How to determine whether I should store data in a flow's
This page discusses two considerations to help you answer this question when writing Metaflow flows. The first is whether the object you want to assign to
self.variable_name is able to be serialized with pickle and the second is about what type of data it is.
1Why Assign Data to the self Keyword?
In Metaflow, data can be assigned to variables with the flow object's
self keyword like
This makes the contents of
self.variable_name accessible in downstream steps or outside of the flow's runtime environment.
Storing data with the
self keyword in this way is referred to as storing flow artifacts.
2 The self Keyword and Serialization
It is important to know that when you use the
self keyword, Metaflow uses Python's built-in pickle module to serialize artifacts. This allows Metaflow to move artifacts so they are accessible in any downstream compute environment you run tasks in. Sometimes you may observe incompatibilities with pickle and popular machine learning libraries. In this case, libraries will typically provide their own serialization mechanism that you can use. Here is an example with XGBoost, which uses a dataset object called the
DMatrix that cannot be serialized with pickle.
3What Type of Data to Assign to self
Generally, there are three types of data that flows will read, create, and write.
- Input data
- Flow internal state
- Output data
self keyword in a flow is meant to track flow internal state for objects that can be pickled.
These artifacts are intended to track the state of variables that change throughout the flow lifecycle.
In a machine learning context, examples of data you might consider a flow artifact include:
- The distribution of a dataset's features.
- Hyperparameters and corresponding performance metric values.
- A URL to a new dataset version that was created during the flow.
How do I?
4What Type of Data Not to Assign to self
In the list of three kinds of data above, you typically will not want to use
self for input and output data.
Input datasets are typically stored in some data warehouse so they don't need to be stored by Metaflow again. They are often large, and it can be costly to duplicate storage by copying into your Metaflow data store. Examples of input datasets include raw data and features for model training.
Similarly, output datasets are meant to be consumed by systems outside Metaflow, so it is better to store them in another database or to a known location. This location might be a S3 bucket or a similar solution that makes sense for the downstream data access pattern. Examples of output datasets include transformed versions of raw datasets.
Instead of using
self for these large datasets, you can efficiently load these kinds of data using Metaflow's built-in cloud data integrations.