Skip to main content

Store Artifacts across Metaflow Steps

Question

How can I use Metaflow to save and version data artifacts such as numpy arrays, pandas dataframes, or other Python objects with Metaflow. How can I access and update artifacts throughout the steps of a flow?

Solution

In this example you will see how you can save any Python object that can be pickled as an artifact - called some_data in this example - by storing it in self. You can then later access and update the artifact with self to propagate changes.

1Run Flow

This flow shows how to

  • Store a flow artifact.
  • Update the artifact in a downstream step.
  • Watch how the artifacts change during the flow.
pass_artifacts_between_steps.py
from metaflow import FlowSpec, step

class ArtFlow(FlowSpec):

@step
def start(self):
self.some_data = [1,2,3] # define artifact state
self.next(self.middle)

@step
def middle(self):
print(f'the data artifact is: {self.some_data}')
self.some_data = [1,2,4] # update artifact state
self.next(self.end)

@step
def end(self):
print(f'the data artifact is: {self.some_data}')

if __name__ == '__main__':
ArtFlow()

When you run the flow, the artifact is correctly accessed across steps. Note that this functionality works regardless if you are running your flows locally or remotely (for example with @batch).

python pass_artifacts_between_steps.py run --run-id-file artifacts-run.txt
    ...
[1654221288112057/middle/2 (pid 71321)] Task is starting.
[1654221288112057/middle/2 (pid 71321)] the data artifact is: [1, 2, 3]
[1654221288112057/middle/2 (pid 71321)] Task finished successfully.
...
[1654221288112057/end/3 (pid 71343)] Task is starting.
[1654221288112057/end/3 (pid 71343)] the data artifact is: [1, 2, 4]
[1654221288112057/end/3 (pid 71343)] Task finished successfully.
...

2Access Artifacts Outside of Flow

You can use the client API to access data artifacts after a run is complete. There are many ways to access this data, but we show you several examples below.

You can reference Run(<FlowName>/<Run ID>) to access artifacts:

from metaflow import Run

# saved the id from previous run in artifacts-run.txt
run_id = open('artifacts-run.txt').read()
some_data = Run(f'ArtFlow/{run_id}').data.some_data
print(some_data)
    [1, 2, 4]

You can also get the artifact from the latest run as demonstrated below:

from metaflow import Flow
assert Flow('ArtFlow').latest_run.data.some_data == [1,2,4]

Further Reading