Skip to main content

Use PyTorch with Metaflow


How do I build and fit a PyTorch model in a Metaflow flow?


PyTorch has many use cases and model types. It is important to think about designing your flow based on which steps you want to run in the same compute environment. For example, if you want to run training with a GPU and validation with a CPU you can separate these into steps and have them use different resources using Metaflow decorators like @batch and @kubernetes.

To structure the model as a flow you need to import Metaflow's FlowSpec object and step object. The flow in this example uses Metaflow's integration with conda for dependency management. In this case a single conda environment is created for all steps in.

The flow shows how to:

  • Read in hyperparameters of the training process with Metaflow's Parameter.
  • Fetch the MNIST dataset and store it in PyTorch DataLoader objects to be used in downstream flow tasks.
  • Train the CNN and save the model as a flow artifact.
  • Evaluate the model in a separate step.
    • Separating training and evaluation steps is useful when you want run different flow steps on different hardware.
from metaflow import (FlowSpec, step, Parameter, 
from torch_utilities import (train, test, Net,

class TorchFlow(FlowSpec):

lr = Parameter('lr', default=0.01)
epochs = Parameter('epochs', default=1)

def start(self):

def get_data(self):
import torch
train_dataset, train_args = get_data_loaders()
test_dataset, test_args = get_data_loaders(
self.train_loader =
train_dataset, **train_args)
self.test_loader =
test_dataset, **test_args)

def fit_model(self):
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
self.model = Net()
optimizer = optim.Adadelta(
scheduler = StepLR(optimizer, step_size=1)
for epoch in range(1, self.epochs + 1):
train(self.model, self.train_loader,
optimizer, epoch)
_ = test(self.model, self.test_loader)

def evaluate_model(self):
self.test_score = test(self.model,
print(f"Model scored {100*self.test_score}%")

def end(self):

if __name__ == "__main__":

In this flow you see a CNN example from the PyTorch examples that is trained on the MNIST digit classification task. Here is the script used in the flow. It contains PyTorch definitions for a convolutional neural network, a training function, a testing function, and a data loading function.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch
from torchvision import datasets, transforms

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)

def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output

def train(model, train_loader, optimizer, epoch):
for idx, (data, target) in enumerate(train_loader):
output = model(data)
loss = F.nll_loss(output, target)
if idx * len(data) % 10000 == 0:
out = 'Train Epoch: ' + \
'{} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, idx * len(data), len(train_loader.dataset),
100. * idx / len(train_loader), loss.item())

def test(model, test_loader):
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
output = model(data)
test_loss += F.nll_loss(
output, target,
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(
return correct / len(test_loader.dataset)

def get_data_loaders(name="train"):
transform = transforms.Compose([
if name=="train":
dataset = datasets.MNIST('../data',
train_args = {'batch_size': 32}
return dataset, train_args
elif name=="test":
dataset = datasets.MNIST('../data', train=False,
test_args = {'batch_size': 32}
return dataset, test_args

Run Flow

With the dependencies and flow defined in, you can run the script with the following command:

python --environment=conda run
Workflow starting (run-id 634):
[634/start/3294 (pid 3782)] Task is starting.
[634/start/3294 (pid 3782)] Task finished successfully.
[634/get_data/3295 (pid 4121)] Task is starting.
100.0%05-25 14:54:17.557 [634/get_data/3295 (pid 4121)] 0.0%
[634/get_data/3295 (pid 4121)]
102.8%05-25 14:54:18.083 [634/get_data/3295 (pid 4121)] 3.5%
[634/get_data/3295 (pid 4121)]
100.0%05-25 14:54:24.052 [634/get_data/3295 (pid 4121)] 0.1%
[634/get_data/3295 (pid 4121)]
112.7%05-25 14:54:24.307 [634/get_data/3295 (pid 4121)] 22.5%
[634/get_data/3295 (pid 4121)]
[634/get_data/3295 (pid 4121)] Task finished successfully.
[634/fit_model/3296 (pid 4267)] Task is starting.
[634/fit_model/3296 (pid 4267)] Train Epoch: 1 [0/60000 (0%)] Loss: 2.299288
[634/fit_model/3296 (pid 4267)] Train Epoch: 1 [20000/60000 (33%)] Loss: 0.907390
[634/fit_model/3296 (pid 4267)] Train Epoch: 1 [40000/60000 (67%)] Loss: 0.384034
[634/fit_model/3296 (pid 4267)] Task finished successfully.
[634/evaluate_model/3297 (pid 4295)] Task is starting.
[634/evaluate_model/3297 (pid 4295)] Model scored 92.52%
[634/evaluate_model/3297 (pid 4295)] Task finished successfully.
[634/end/3298 (pid 4316)] Task is starting.
[634/end/3298 (pid 4316)] Task finished successfully.

Further Reading

For help or feedback, please join Metaflow Slack. To suggest an article, you may open an issue on GitHub.

Join Slack