Skip to main content

Whether to Use a Flow's self Keyword

Question

How to determine whether I should store data in a flow's self keyword?

Solution

This page discusses two considerations to help you answer this question when writing Metaflow flows. The first is whether the object you want to assign to self.variable_name is able to be serialized with pickle and the second is about what type of data it is.

1Why Assign Data to the self Keyword?

In Metaflow, data can be assigned to variables with the flow object's self keyword like self.variable_name. This makes the contents of self.variable_name accessible in downstream steps or outside of the flow's runtime environment. Storing data with the self keyword in this way is referred to as storing flow artifacts.

2 The self Keyword and Serialization

It is important to know that when you use the self keyword, Metaflow uses Python's built-in pickle module to serialize artifacts. This allows Metaflow to move artifacts so they are accessible in any downstream compute environment you run tasks in. Sometimes you may observe incompatibilities with pickle and popular machine learning libraries. In this case, libraries will typically provide their own serialization mechanism that you can use. Here is an example with XGBoost, which uses a dataset object called the DMatrix that cannot be serialized with pickle.

3What Type of Data to Assign to self

Generally, there are three types of data that flows will read, create, and write.

  1. Input data
  2. Flow internal state
  3. Output data

Using the self keyword in a flow is meant to track flow internal state for objects that can be pickled. These artifacts are intended to track the state of variables that change throughout the flow lifecycle.

In a machine learning context, examples of data you might consider a flow artifact include:

  • The distribution of a dataset's features.
  • Hyperparameters and corresponding performance metric values.
  • A URL to a new dataset version that was created during the flow.

How do I?

Pass Artifacts through a Join Step


Save and Version State of Artifacts


4What Type of Data Not to Assign to self

In the list of three kinds of data above, you typically will not want to use self for input and output data.

Input datasets are typically stored in some data warehouse so they don't need to be stored by Metaflow again. They are often large, and it can be costly to duplicate storage by copying into your Metaflow data store. Examples of input datasets include raw data and features for model training.

Similarly, output datasets are meant to be consumed by systems outside Metaflow, so it is better to store them in another database or to a known location. This location might be a S3 bucket or a similar solution that makes sense for the downstream data access pattern. Examples of output datasets include transformed versions of raw datasets.

Instead of using self for these large datasets, you can efficiently load these kinds of data using Metaflow's built-in cloud data integrations.

How do I?

Load CSV Data in Metaflow Steps


Load from S3 to pandas


Chunk a DataFrame using Foreach