Skip to main content

Intermediate Computer Vision: Episode 5

Now that you have the core elements for a robust computer vision training environment, how can you use the TrainHandGestureClassifier flow to iteratively find the best models for your use case?

For the rest of this tutorial, we will demonstrate two important elements of iterative model development: checkpointing model state and tracking experiment results.

To follow along with this page, you can access this Jupyter notebook.

1Checkpoint Models

Checkpointing in model development essentially means that you save the state of a model, so you can resume it at a later time. This way you can make sure you do not lose results, such as your trained model. It also ensures you have a process to load an already trained model in future training and production scenarios while avoiding duplication of costly computation.

In the PyTorch example used in the TrainHandGestureClassifier flow, a "checkpoint" refers to this code.

checkpoint_dict = {
'state_dict': model.cpu().state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'epoch': epoch,
'config': config_dict
}, checkpoint_path)

You can save any artifact of the training work you have done in a checkpoint_dict like this. Then, you can resume the model state from the checkpoint.

from models.mobilenetv3 import MobileNetV3
model = MobileNetV3(num_classes=num_classes, size='small', pretrained=pretrained, freezed=freezed)
checkpoint = torch.load(checkpoint, map_location=torch.device(device))["state_dict"]
model.load_state_dict(checkpoint, strict=False)
return model

Here are more general resources from the PyTorch documentation on checkpointing.

2Uploading the Best Model to the Cloud

Model checkpoints in this example are written to the best_model.pth location. But if we are running on a remote compute instance, how do we move this checkpoint to a cloud resource that will persist beyond the lifecycle of the compute task? Again, Metaflow's S3 client makes this easy!


There are many ways to structure model development workflows. You may not have to store model checkpoints in the cloud, for example. You also might prefer to use Metaflow's IncludeFile pattern to move data of this type onto remote compute instances.

After each time model performs better than the previous best one, it is checkpointed and the result is uploaded to the cloud using this snippet:

path_to_best_model = os.path.join(experiment_path, 'best_model.pth')
with S3(s3root = experiment_cloud_storage_path) as s3:
s3.put_files([(path_to_best_model, path_to_best_model)])

3Resuming the Best Model State

The payoff of checkpointing in this way is that now you can easily resume the model from this state.

In a notebook or Python script you can now evaluate the model, train it further, or iterate on the model architecture (PyTorch allows you to build dynamic graphs).

from import _initialize_model
from omegaconf import OmegaConf

model_path = 'best_model.pth'

model = _initialize_model(
conf = OmegaConf.load('hagrid/classifier/config/default.yaml'),
model_name = 'ResNet18',
checkpoint_path = model_path, # can be local or S3 URI.
device = 'cpu'
except FileNotFoundError:
print("Are you sure you trained a model and saved the file to {}".format(model_path))
    Building ResNet18
Building model from local checkpoint at: best_model.pth

Because of how the TrainHandGestureClassifier uses Metaflow's built-in versioning capabilities, we can also resume model training with the --checkpoint parameter when you run the TrainHandGestureClassifier defined in

Using model state at the end of run N of the TrainHandGestureClassifier flow as the initial state of the model in run N+1

This checkpoint parameter can either be a .pth file in an S3 bucket or a path to a local .pth file. In general, you can add arbitrary parameter types like this to your Metaflow flows in one line of code. This helps you write flexible code that can read in data from external systems in a variety of ways.

A demonstration of using a model checkpoint from a past run in the command to run the TrainHandGestureClassifier flow
python --package-suffixes '.yaml' run --epochs 1 --model 'ResNet18' --checkpoint 'best_model.pth'
     Workflow starting (run-id 187938):
[187938/start/1013688 (pid 28793)] Task is starting.
[187938/start/1013688 (pid 28793)] Training ResNet18 in flow TrainHandGestureClassifier
[187938/start/1013688 (pid 28793)] Task finished successfully.
[187938/train/1013689 (pid 28800)] Task is starting.
[187938/train/1013689 (pid 28800)] Downloading images...
[187938/train/1013689 (pid 28800)] Done!
[187938/train/1013689 (pid 28800)] Downloading annotations...
[187938/train/1013689 (pid 28800)] Done!
[187938/train/1013689 (pid 28800)] Building ResNet18
[187938/train/1013689 (pid 28800)] Building model from local checkpoint at: best_model.pth
[187938/train/1013689 (pid 28800)] Best model checkpoint saved at s3://oleg2-s3-mztdpcvj/metaflow/TrainHandGestureClassifier/187938/experiments/ResNet18/best_model.pth
[187938/train/1013689 (pid 28800)]
[187938/train/1013689 (pid 28800)] View Tensorboard results in your browser with this command:
[187938/train/1013689 (pid 28800)] tensorboard --logdir=s3://oleg2-s3-mztdpcvj/metaflow/TrainHandGestureClassifier/187938/experiments/ResNet18/logs
[187938/train/1013689 (pid 28800)]
[187938/train/1013689 (pid 28800)] Best model checkpoint saved at s3://oleg2-s3-mztdpcvj/metaflow/TrainHandGestureClassifier/187938/experiments/ResNet18/best_model.pth
[187938/train/1013689 (pid 28800)] Task finished successfully.
[187938/end/1013690 (pid 28927)] Task is starting.
[187938/end/1013690 (pid 28927)] Task finished successfully.

In this lesson, you saw how to ensure you don't lose progress as you iterate on your model using checkpoints. You learned how to store model checkpoints and resume that state from a notebook or as the starting point in a subsequent flow. In the next lesson, we will complete the tutorial by demonstrating the use of TensorBoard's experiment tracking solution with Metaflow.