Skip to main content

Share Local Data with S3


How do I load data from a local directory structure on AWS Batch using Metaflow's S3 client?


When using Metaflow's @batch decorator as a compute environment for a step, there are several options for accessing data. This page will show how to:

  • Serialize data in a non-pickle format from a local step.
  • Upload it to S3 using Metaflow's client.
  • Read the data from a downstream step that runs on AWS Batch or Kubernetes.

1Acquire Data

The example will access this CSV file:

1, 2, 3
4, 5, 6

2Configure metaflow.S3

When using Metaflow's @batch decorator you need to have an S3 bucket configured. When S3 is configured in ~/.metaflow_config/config.json artifacts defined like self.artifact_name will be serialized and stored on S3. This means that for most cases you don't need to directly call Metaflow's S3 client. However, for a variety of reasons you may want to access arbitrary S3 bucket contents.

3Run Flow

This flow shows how to:

  • Read the contents of local_data.csv using IncludeFile.
  • Serialize the contents of the file using the json module.
  • Get the data on AWS S3.
from metaflow import (FlowSpec, step, IncludeFile, 
batch, S3)
import json

class S3FileFlow(FlowSpec):

data = IncludeFile('data',

def start(self):
with S3(run=self) as s3:
res = json.dumps({'data':})
url = s3.put('data', res)

def read_from_batch(self):
# change `run=self` to any run
with S3(run=self) as s3:
data = s3.get('data').text
print(f"File contents: {json.loads(data)}")

def end(self):
print('Finished reading the data!')

if __name__ == '__main__':
python run
[467/end/2405 (pid 46565)] Task is starting.
[467/end/2405 (pid 46565)] Finished reading the data!
[467/end/2405 (pid 46565)] Task finished successfully.

Further Reading