Skip to main content

Intermediate Computer Vision: Episode 6

This episode will focus on tracking model training results with TensorBoard.

When building a machine learning system, it is necessary to track results to make decisions that improve models. Sometimes, though, it isn't clear where to store results so that they are organized and accessible to the people who you want to see them. In this episode, you will see how to use the built-in versioning of the Metaflow datastore to organize TensorBoard logs by the Metaflow run that produced them.

1Controlling TensorBoard Logs

With TensorBoard, you can control where to log results using the log_dir parameter. You may want to do this in cases like the train step of the TrainHandGestureClassifier, where we are writing TensorBoard logs from an ephemeral compute instance. The goal is to write these logs to a persistent location that we can read from any computer with access to the S3 object. The approach taken here is to use the existing Metaflow datastore, and its built-in versioning capabilities, to organize TensorBoard logs produced in Metaflow runs.

2Store TensorBoard Results in S3

In our case, we are using the TensorBoard and PyTorch integration. We can set the log_dir location like:

log_dir = os.path.join(tensorboard_s3_prefix, experiment_path, "logs")
writer = torch.utils.tensorboard.SummaryWriter(log_dir=log_dir)
...
writer.add_scalar(f"loss/train", loss_value, step)

If an s3 prefix is used for the log_dir argument of SummaryWriter, then TensorBoard will log results. We can use the Metaflow config to determine where we want to write the results. For example, you will see the following logic to set the TensorBoard log storage location in the TrainHandGestureClassifier code:

datastore = metaflow_config.METAFLOW_CONFIG['METAFLOW_DATASTORE_SYSROOT_S3']
self.experiment_storage_prefix = os.path.join(datastore, current.flow_name, current.run_id)

The train step will then write TensorBoard logs to <experiment_storage_prefix>/experiments/logs.

3View TensorBoard Results in S3

A demonstration of accessing TensorBoard results using what the TrainHandGestureClassifier flow writes to stdout

After running the TrainHandGestureClassifier flow you will see a URI printed with the location where TensorBoard logs are stored. You can run the following with your path:

tensorboard --logdir=<tensorboard_s3_prefix>/experiments

This can be run from the command line on your computer, assuming you have access to the S3 bucket which will be in the AWS account where your Metaflow deployment is.

Summary

Congratulations! You have completed all of the episodes in our Computer Vision Training in the Cloud tutorial. In these episodes, you have learned how to:

  • Use a PyTorch DataLoader and a custom Dataset.
  • Use Metaflow's S3 client to efficiently move data between your local machine, S3, and ephemeral compute instances that run Metaflow tasks.
  • Create a flow that performs transfer learning on state-of-the-art computer vision models.
  • Train models on GPUs.
  • Set up model checkpoints to resume model state in flows and notebooks, saving costly progress.
  • Use TensorBoard to track model training results, leveraging Metaflow's built-in versioning to organize results.

To keep progressing in your Metaflow journey you can: