Share Local Data with S3

Question

How do I load data from a local directory structure on AWS Batch using Metaflow's S3 client?

Solution

When using Metaflow's @batch decorator as a compute environment for a step, there are several options for accessing data. This page will show how to:

Serialize data in a non-pickle format from a local step.
Upload it to S3 using Metaflow's client.
Read the data from a downstream step that runs on AWS Batch or Kubernetes.

1Acquire Data

The example will access this CSV file:

local_data.csv

1, 2, 3
4, 5, 6

2Configure metaflow.S3

When using Metaflow's @batch decorator you need to have an S3 bucket configured. When S3 is configured in ~/.metaflow_config/config.json artifacts defined like self.artifact_name will be serialized and stored on S3. This means that for most cases you don't need to directly call Metaflow's S3 client. However, for a variety of reasons you may want to access arbitrary S3 bucket contents.

3Run Flow

This flow shows how to:

Read the contents of local_data.csv using IncludeFile.
Serialize the contents of the file using the json module.
Get the data on AWS S3.

local_data_on_batch_s3.py
from metaflow import (FlowSpec, step, IncludeFile, 
                      batch, S3)
import json

class S3FileFlow(FlowSpec):
        
    data = IncludeFile('data', 
                       default='./local_data.csv')

    @step
    def start(self):
        with S3(run=self) as s3:
            res = json.dumps({'data': self.data})
            url = s3.put('data', res)
        self.next(self.read_from_batch)
        
    @batch(cpu=1)
    @step
    def read_from_batch(self):
        # change `run=self` to any run
        with S3(run=self) as s3: 
            data = s3.get('data').text
            print(f"File contents: {json.loads(data)}")
        self.next(self.end)

    @step
    def end(self):
        print('Finished reading the data!')

if __name__ == '__main__':
    S3FileFlow()

python local_data_on_batch_s3.py run

    ...
     [467/end/2405 (pid 46565)] Task is starting.
     [467/end/2405 (pid 46565)] Finished reading the data!
     [467/end/2405 (pid 46565)] Task finished successfully.
    ...

Question​

Solution​

1Acquire Data​

2Configure metaflow.S3​

3Run Flow​

Further Reading​

Question

Solution

1Acquire Data

2Configure metaflow.S3

3Run Flow

Further Reading