This episode will focus on tracking model training results with TensorBoard.
When building a machine learning system, it is necessary to track results to make decisions that improve models. Sometimes, though, it isn't clear where to store results so that they are organized and accessible to the people who you want to see them. In this episode, you will see how to use the built-in versioning of the Metaflow datastore to organize TensorBoard logs by the Metaflow run that produced them.
1Controlling TensorBoard Logs
With TensorBoard, you can control where to log results using the
You may want to do this in cases like the
train step of the
where we are writing TensorBoard logs from an ephemeral compute instance.
The goal is to write these logs to a persistent location that we can read from any computer with access to the S3 object.
The approach taken here is to use the existing Metaflow datastore, and its built-in versioning capabilities, to organize TensorBoard logs produced in Metaflow runs.
2Store TensorBoard Results in S3
In our case, we are using the TensorBoard and PyTorch integration. We can set the
log_dir location like:
log_dir = os.path.join(tensorboard_s3_prefix, experiment_path, "logs")
writer = torch.utils.tensorboard.SummaryWriter(log_dir=log_dir)
writer.add_scalar(f"loss/train", loss_value, step)
If an s3 prefix is used for the
log_dir argument of
SummaryWriter, then TensorBoard will log results.
We can use the Metaflow config to determine where we want to write the results.
For example, you will see the following logic to set the TensorBoard log storage location in the
datastore = metaflow_config.METAFLOW_CONFIG['METAFLOW_DATASTORE_SYSROOT_S3']
self.experiment_storage_prefix = os.path.join(datastore, current.flow_name, current.run_id)
train step will then write TensorBoard logs to
3View TensorBoard Results in S3
After running the
TrainHandGestureClassifier flow you will see a URI printed with the location where TensorBoard logs are stored.
You can run the following with your path:
This can be run from the command line on your computer, assuming you have access to the S3 bucket which will be in the AWS account where your Metaflow deployment is.
Congratulations! You have completed all of the episodes in our Computer Vision Training in the Cloud tutorial. In these episodes, you have learned how to:
- Use a PyTorch
DataLoaderand a custom
- Use Metaflow's S3 client to efficiently move data between your local machine, S3, and ephemeral compute instances that run Metaflow tasks.
- Create a flow that performs transfer learning on state-of-the-art computer vision models.
- Train models on GPUs.
- Set up model checkpoints to resume model state in flows and notebooks, saving costly progress.
- Use TensorBoard to track model training results, leveraging Metaflow's built-in versioning to organize results.
To keep progressing in your Metaflow journey you can: