New in Metaflow: The Long-Awaited @pypi Decorator

Authors

Today, we are releasing a new @pypi decorator in Metaflow. It looks like a step-level pip install but with superpowers under the hood. Read the docs to get started quickly.

Careful management of software dependencies is one of the most underrated parts of ML and AI systems, despite being critically important for the stability of production deployments as well as for the speed of development.

Metaflow has always taken the challenge seriously. For the past many months, we have worked with Netflix and the wider Python package management community to improve the surprisingly sophisticated dependency management machinery inside Metaflow, culminating in the release of a new @pypi decorator and an improved @conda today.

Consider the key benefits of using the new decorators for dependency management:

  • Projects execute in stable, reproducible environments without having to manage dependencies or environments manually.
  • You can execute flows at scale in the cloud with all their dependencies included, without having to create Docker images manually.
  • You can ensure the stability of production deployments by isolating deployments from any dependency-related issues, thanks to snapshotting of packages.
  • You are able to audit the full software supply chain of flows, guaranteeing that no external packages get used without explicit action.

Much Ado About Dependencies

Our investment in dependency management is motivated by the daily pain and friction experienced by countless ML/AI teams. To highlight this, let's reenact a typical dialogue between our usual cast of characters: A data scientist 🧙 and an infrastructure engineer 👷.

🧙 My new PyTorch model works! Let's ship it!

👷 That's great! Btw, how did you install pytorch?

🧙 Hmm, I just ran pip install torch on my laptop.

👷 It'd a good idea to pin the version so that your model doesn't break when PyTorch releases a new version.

🧙 Good idea! I'll do pip install torch==1.13.0

👷 Actually that's not enough. PyTorch may have transitive dependencies that are floating. You may get a different set of packages when you run that command tomorrow, causing subtle changes in your model.

It'd be better to pin all dependencies, including transitive dependencies. Unfortunately pip doesn't support this easily. Maybe you can try poetry which supports lockfiles.

🧙 Ok, I'll look into using Poetry.

👷 Great. You can use it to manage virtual environments too. Make sure to test in a clean environment to ensure reproducibility.

🧙 Ugh. I have too many environments already...

👷 I hear you. Anyways, the production model isn't going to run on your laptop. Have you tried to run it in the cloud in a production-like environment?

🧙 Good point! I will do it. It shouldn't be hard, right?

👷 Well, your laptop has an M2 CPU and OS X operating system. The production systems use Linux and Intel CPUs. The same lockfile isn't going to work.

🧙 Ugh. I'll move my dev environment to a cloud workstation, create a lockfile there, and install it on the fly on containers running the code.

👷 No no, you can't do that. Once we had a project doing hyperparameter tuning on 200 containers in parallel, trying to install packages on the fly. It's like a DDOS attack against the upstream repositories. They started failing randomly and throttling installation. It's too fragile in a production setting.

🧙 Argh. What should I do then?

👷 I'll create a Docker image for you.

And so they deploy the system in production with a hand-curated Docker image.

A month later

🧙 PyTorch 2.0 was just released! Can we upgrade the image to use it.

👷 Did you test it in staging?

🧙 No. I can't do it without you upgrading the image first.

👷 We can't just upgrade the production image without testing. I guess I need to create a separate testing image for you.

🧙 Cool. While you are doing it, can you install tensorflow too. I want to test it as well.

👷 Ugh, the package is 500MB. Are you sure you need it? The image is getting huge.

🧙 I just need it for testing. Oh, and add fancyimpute.

👷 Fancy what? I don't have time to vet every random package.

🧙 I'd be happy to do it by myself. I just want to avoid all the issues you pointed out.

👷 Right... We really need a better way to manage dependencies.

Dependency management with Metaflow

The new @pypi decorator and an improved @conda decorator are designed to address the above pain points. From the point of view of a developer 🧙, using them is as easy as running pip install locally. Just specify what packages you need at the step level, like pandas in this example:

from metaflow import FlowSpec, step, pypi, card
import random

class PandasFlow(FlowSpec):

    @pypi(python='3.10.11', packages={'pandas': '2.1.1'})
    @card
    @step
    def start(self):
        import pandas as pd
        data = {
            "x": list(range(10)),
            "y": [x**2 for x in range(10)]
        }
        self.df = pd.DataFrame(data)
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    PandasFlow()

Install the latest version of Metaflow and run the code as follows:

python pandasflow.py --environment=pypi run

(Don't forget to take a look at the Metaflow card created by the snippet!)

When using @pypi or @conda, Metaflow frees the developer 🧙 from having to manage dependencies manually. Here, Metaflow

  1. Sets up isolated virtual environments for every step automatically.
  2. Creates implicit lockfiles, freezing the complete dependency graphs.
  3. Provides a consistent way to create reproducible projects that can be shared across colleagues without friction.

However, the decorators are not just syntactic sugar over pip or poetry. Under the hood, they take care of many core concerns related to scalable compute and reliable production systems.

Building reliable AI/ML systems

The superpower of @pypi and @conda stem from the fact that they create and manage relocatable, stable execution environments under the hood.

From the point of view of a platform engineer 👷, the decorators take care of many concerns that naive adhoc solutions for dependency management tend to create, allowing you to control and audit the whole software supply-chain of your ML and AI systems.

The decorators allow you to run flows at arbitrary scale in the cloud:

python pandasflow.py --environment=pypi run --with kubernetes

Crucially, Metaflows saves you from having to bake and manage hundreds of custom Docker images for different use cases, or, even better, it saves you from the headache of having to maintain a single universal image.

You can also deploy production workflows that run in stable environments that are fully isolated from any dependency issues:

python pandasflow.py --environment=pypi argo-workflows create

Magically, this can even work with your existing workflow orchestrator:

python pandasflow.py --environment=pypi airflow create

If you are curious to know how all this works under the hood, take a look at the internals of Metaflow's dependency management subsystem:

Pypi internals

Acknowledgements

Romain Cledat of Netflix has been leading the development of the new dependency management subsystem in Metaflow. Today's release includes only a part of improvements that Netflix has been working on. If you are curious, you can test the bleeding edge versions of the @pypi and @conda decorators by installing Netflix's extensions for Metaflow.

Wolf Vollprecht of Prefix.dev is a core contributor in the Python dependency management ecosystem, a Conda Forge veteran, and the creator of the mamba package manager. He and his team helped us build the mamba-based machinery that Metaflow uses under the hood.

Try it out!

You can start using @pypi and @conda today:

  1. Take a look at the all new documentation about dependency management with Metaflow.
  2. Make sure you have the latest version of Metaflow, version 2.10.0.
  3. Add @pypi in your flows and start running!

PS. If the topics of dependency management and software supply-chain security pique your inner 👷, see our recent Fireside Chat about the topic.