Skip to main content

Share Local Data with S3

Question

How do I load data from a local directory structure on AWS Batch using Metaflow's S3 client?

Solution

When using Metaflow's @batch decorator as a compute environment for a step, there are several options for accessing data. This page will show how to:

  • Serialize data in a non-pickle format from a local step.
  • Upload it to S3 using Metaflow's client.
  • Read the data from a downstream step that runs on AWS Batch or Kubernetes.

1Acquire Data

The example will access this CSV file:

local_data.csv
1, 2, 3
4, 5, 6

2Configure metaflow.S3

When using Metaflow's @batch decorator you need to have an S3 bucket configured. When S3 is configured in ~/.metaflow_config/config.json artifacts defined like self.artifact_name will be serialized and stored on S3. This means that for most cases you don't need to directly call Metaflow's S3 client. However, for a variety of reasons you may want to access arbitrary S3 bucket contents.

3Run Flow

This flow shows how to:

  • Read the contents of local_data.csv using IncludeFile.
  • Serialize the contents of the file using the json module.
  • Get the data on AWS S3.
local_data_on_batch_s3.py
from metaflow import (FlowSpec, step, IncludeFile, 
batch, S3)
import json

class S3FileFlow(FlowSpec):

data = IncludeFile('data',
default='./local_data.csv')

@step
def start(self):
with S3(run=self) as s3:
res = json.dumps({'data': self.data})
url = s3.put('data', res)
self.next(self.read_from_batch)

@batch(cpu=1)
@step
def read_from_batch(self):
# change `run=self` to any run
with S3(run=self) as s3:
data = s3.get('data').text
print(f"File contents: {json.loads(data)}")
self.next(self.end)

@step
def end(self):
print('Finished reading the data!')

if __name__ == '__main__':
S3FileFlow()
python local_data_on_batch_s3.py run
    ...
[467/end/2405 (pid 46565)] Task is starting.
[467/end/2405 (pid 46565)] Finished reading the data!
[467/end/2405 (pid 46565)] Task finished successfully.
...

Further Reading