Metaflow 2.11: Building Observable ML/AI Systems

Authors

Today, we are releasing three major enhancements in open-source Metaflow Cards:

  1. Cards can now update live while a task is executing, so you can monitor progress and eyeball results in real time.
  2. New card components: Observe task progress with built-in progress bars and visualize them using infinitely powerful charts based on Vega Lite.
  3. A Local card viewer that allows you to view live cards without having to install the Metaflow UI, which is useful for getting a quick feel for cards and developing them rapidly.

To get a taste of what you can do with the new dynamic cards, take a look at this teaser video:

If you are eager to test the new features hands-on, you can do so easily in the Metaflow Sandbox, which provides a new playground for dynamic cards. For more background on why all this matters, keep on reading.

Background

Metaflow helps you access data, run compute at scale, and orchestrate complex sets of ML and AI workflows while keeping track of all results automatically. In other words, it helps you build real-life data, ML, and AI systems.

It is one thing to build systems and another to operate them reliably. Operations are hard because systems can misbehave or fail in innumerable ways; we need to quickly understand why and proceed to resolve the issue. The challenge is especially pronounced in data, ML, and AI systems, which can exhibit a cornucopia of failure patterns - some related to models, some to code, some to infrastructure, and many related to data.

Monitoring models vs. observing systems

You may have heard of the term model monitoring in the context of ML systems. It is not a bad idea to keep an eye on models, but monitoring models captures only a sliver of operational challenges. Consider this excellent blog post by Cindy Sridharan, which contrasts the concepts of monitoring and observability:

“Monitoring” is best suited to report the overall health of systems. Aiming to “monitor everything” can prove to be an anti-pattern. Monitoring, as such, is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes, as well as black box tests.

“Observability,” on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes.

We released Metaflow Cards in 2022 to make it easy to create highly granular, customizable views into the behavior of your ML/AI workflows:

From the observability point of view, cards provide three key features:

  1. They are easily customizable with a few lines of Python, live right alongside your code and models, so you can focus on things that matter at arbitrary levels of detail. This means that you can integrate observability in your workflows from the beginning instead of adding it in a separate tool as an afterthought (typically only after the first painful failures).
  2. They are tightly coupled with their execution context (a @step), versioned, and retained for posterity, in contrast to dashboards that can go out-of-sync with reality quickly.
  3. Using them requires adding no new dependencies, no new infrastructure to maintain, no new data pipelines to build, and no finicky integrations to manage. Just add @card!

Over the past years, cards have grown to be a core feature of Metaflow. They have been used to create model reports, share results, and provide insights into the state of countless workflows, as well as having been featured in an academic paper. Cards complement, not substitute, other tools like notebooks, which are great for interactive ad-hoc exploration, and dashboards, which can work well for monitoring established metrics.

The original cards came with a limitation though: They were generated only after a task had completed, limiting their usefulness in observing the state of long-running tasks like model training or a demanding data processing job.

Today, with the release of the new dynamic cards, we are happy to announce that this limitation is lifted, making cards an even more versatile tool for observability!

New in Metaflow: Cards that update in real-time

The baseline of all software observability is the print statement. It is perfectly robust and easy to use but limited in its output format. We wanted to make status updates and rich visualizations as easy as adding a print line.

For instance, take a look at this fully functional Metaflow flow that shows a progress bar and a timer:

Basic Progress Bar

Try it out directly in your browser

Open in Sandbox
from metaflow import step, FlowSpec, current, card
from metaflow.cards import Markdown, ProgressBar

class ClockFlow(FlowSpec):
    @card(type="blank", refresh_interval=1)
    @step
    def start(self):
        from datetime import datetime
        import time

        m = Markdown("# Clock is starting 🕒")
        p = ProgressBar(max=30, label="Seconds passed")
        current.card.append(m)
        current.card.append(p)
        current.card.refresh()

        for i in range(31):
            t = datetime.now().strftime("%H:%M:%S")
            m.update(f"# Time is {t}")
            p.update(i)
            current.card.refresh()
            time.sleep(1)

        m.update("# ⏰ ring ring!")
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    ClockFlow()

You can test the code in Metaflow Sandbox or on your laptop as follows:

  1. Start a card viewer: python progress.py card server
  2. Run the flow: python progress.py run
  3. Point your browser at localhost:8423 to see this:

The new card server command makes cards much more approachable for new users of Metaflow too who might not have the Metaflow UI already installed. It comes in handy during rapid development too, as you can see the results with minimal latency.

Rich visualizations

Instead of providing a zillionth new visualization library for Python, our new VegaChart component relies on a popular and versatile Vega Lite specification. Here are some examples of live charts that you can produce:

A unique benefit of Vega Lite is that it is a JSON specification, so you use it without any external dependencies. As before, cards produced by Metaflow are self-contained HTML files, so you can use them in security-conscious environments.

If you want to get a convenient Python wrapper for Vega, you can use the Vega-Altair Library, as in this example:

Basic Altair Example

Try it out directly in your browser

Open in Sandbox
from metaflow import step, FlowSpec, current, card, pypi
from metaflow.cards import VegaChart

class AltairFlow(FlowSpec):
    @pypi(packages={"altair": "5.2.0", "vega-datasets": "0.9.0"}, python="3.11.7")
    @card(type="blank")
    @step
    def start(self):
        import altair as alt
        from vega_datasets import data

        source = data.cars()
        brush = alt.selection_interval()
        points = (
            alt.Chart(source, width=500, height=400)
            .mark_point()
            .encode(
                x="Horsepower:Q",
                y="Miles_per_Gallon:Q",
                color=alt.condition(brush, "Origin:N", alt.value("lightgray")),
            )
            .add_params(brush)
        )
        bars = (
            alt.Chart(source)
            .mark_bar()
            .encode(y="Origin:N", color="Origin:N", x="count(Origin):Q")
            .transform_filter(brush)
        )

        chart = VegaChart.from_altair_chart(points & bars)
        current.card.append(chart)
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    AltairFlow()

This code produces a beautiful, interactive chart:

You can find many more examples in the Dynamic Cards gallery on GitHub.

Infinitely customizable

You should be able to go far with the new live Markdown, ProgressBar, and VegaChart components, which require only a few lines of Python to use. Should you hit any limitations of these built-in components, you are not blocked: You can also create a fully custom card that can use any Javascript libraries wrapped in arbitrary HTML.

To showcase what’s possible with this approach, here’s an example of a WebGL-powered 3D scatter plot, powered by a scatter3d custom card:

If you are curious, you can read more about creating custom cards and use this example as a starting point for your own cards. When you are done, you can share and install custom cards like this as any Python package:

pip install metaflow-card-scatter3d

If you have created a card that you want to share with the community or you are interested in learning how to do it, please join the Metaflow Slack and let everyone know on #ask-metaflow.

Always-on observability

Observing experiments is nice, but observing production systems is critical. The new dynamic cards are seamlessly integrated into all existing Metaflow features, aimed at building production-grade ML/AI systems. For instance, you can run the above code in the cloud:

python progress.py run –with kubernetes

You can use dynamic cards to monitor demanding compute tasks, such as distributed training which Metaflow makes approachable with the new @ray, @mpi, @pytorch, and @deepspeed decorators.

Once you are happy with results, you can deploy the flow in production, to be scheduled by a highly-available production orchestrator:

python progress.py argo-workflows create

The same few lines of code work from prototyping to production without having to set up any additional tooling or infrastructure, providing you always-on, in-depth observability with minimal extra effort.

Try live in Metaflow Sandbox

You can start using dynamic cards today! Take a look at a gallery of dynamic cards and updated documentation.

To see dynamic cards in action with the full Metaflow stack, head to the new dynamic cards playground in Metaflow Sandbox, which contains live examples of all examples in the gallery which you can modify and hack at will. We are happy to help you on Metaflow Slack, if you have any questions.

If you are interested in building production-grade, secure, scalable ML/AI systems with fine-grained observability powered by dynamic cards, you can start doing it on Outerbounds today. You can get started for free - deploying a full stack of ML and AI infrastructure in your cloud account in only 15 minutes!