Metaflow 2.9: Building Reactive Machine Learning Systems with Event Triggering

We are excited to announce open-source Metaflow 2.9 which allows you to compose systems from multiple interrelated workflows, triggered by external events in real-time. Use the new feature to build sequences of workflows that transmit data between teams, all the way from ETL and data warehouses to final ML outputs.

As highlighted by a tsunami of new AI demos and applications released over the past six months, we are able to use machines to solve increasingly deep and nuanced real-world problems thanks to data, ML, and AI. However, building production-grade solutions for such advanced use cases is not a simple undertaking.

The exciting new opportunities come with a familiar engineering challenge: Beyond prototypes, how can we ensure that we are able to improve our systems rapidly, and maintain them without too many headaches, recognizing the amount of real-world complexity involved? While the question is not new, it is exacerbated to a new level by the presence of data and rapidly evolving ML and AI.

As a representative example, consider a modern recommendation system:

The system consists of three distinct components, modeled as workflows:

  1. Based on fresh data updated hourly, for example, a set of deep learning models is used to produce embeddings capturing aspects of user behavior and their context.
  2. The embeddings together with items are used to produce personalized recommendations.
  3. The recommendations are distributed to a global caching layer, so they can be served in the product with minimal latency and maximum availability.

Each one of these components may be a sizable engineering system by itself. They can be substantial enough to be built and maintained by separate teams.

Looking at the components, one might suggest that we should just simplify the whole setup. Sometimes it is an option, but often dynamic real-world phenomena are complex by nature and we just need to learn to deal with their inherent complexity. Fortunately, complex systems don’t need to be complicated.

Engineering systems with Metaflow

There is a tried and true way to engineer maintainable, complex systems: Build them from robust, composable components iteratively, adding complexity only when necessary. Each component is a functional unit with a clearly defined boundary, so they can be developed and tested independently, and finally, assembled together in creative ways.

Metaflow supports three levels of composition:

  • Modular code level: You can implement logical units of modeling, data processing, and business logic as individual steps. Steps can leverage the full toolbox of easy-to-use, well-documented abstractions provided by Python: Functions, classes, modules, and packages. You can share common modules across steps and leverage 3rd party libraries with Metaflow’s built-in dependency management.
  • Workflow level: Concurrently with the step development, you can compose a workflow that determines where and when the steps are executed. Metaflow provides a clear way to pass data between steps, allocate resources, and leverage parallelism. One of Metaflow’s superpowers is to make it easy to test workflows locally, so there isn’t much overhead to move between Python code and workflows.
  • System level: Once you have one or more functional workflows, you may want to start composing a larger system from them. A workflow is a tightly coupled unit of execution: All of its steps either fail or succeed together. To complement this pattern, larger systems benefit from decoupled, asynchronous workflow composition, where a workflow or an external event triggers an independent execution.

New in Metaflow 2.9: Event Triggering

The new event triggering feature, now available in Metaflow 2.9, enables the third pattern of decoupled, system-level composition. For years, this pattern has been used at Netflix to power essentially all Metaflow flows in production, so we are very excited to support it in open-source finally!

Following the human-centric philosophy of Metaflow, powerful features come in a simple package. You don’t need to be an expert in systems engineering to be able to leverage event triggering, as demonstrated by the examples below. For a more comprehensive overview, see our new documentation for event-triggering.

Trigger based on an external event

Let’s start with a common scenario: Triggering a workflow whenever new data appears in the data warehouse:

To trigger a workflow based on an external event, just add a @trigger decorator to your flow:

@trigger(event='data_updated')
class FreshDataFlow(FlowSpec):
    ...

This decorator instructs Metaflow to start the workflow whenever an instance of the event data_updated appears in the event queue. To push an event in the queue, say, in an ETL pipeline, use the following two lines:

from metaflow.integrations import ArgoEvent
ArgoEvent(name='data_updated').publish()

For instance, you can call this line at the end of an ETL pipeline, bridging data engineering and data science.

Flows of flows

The recommendation example above highlighted a core pattern: Chaining workflows together so that the completion of one workflow triggers another automatically:

Given the importance of this pattern, Metaflow provides dedicated support for it in the form a decorator, @trigger_on_finish. Here is how to make SecondFlow trigger automatically upon completion of FirstFlow:

@trigger_on_finish(flow='FirstFlow')
class SecondFlow(FlowSpec):
    ...

You don’t need to create events explicitly to chain workflows this way, as Metaflow creates completion events automatically.

Crucially, the owner of FirstFlow doesn’t need to know that SecondFlow depends on it. In other words, a workflow, like the embedding flow mentioned earlier in this article, can have multiple consumers that are fully decoupled from it. Workflow owners may develop and iterate on their workflows autonomously without excessive coordination and communication overhead.

Passing data across flows

Besides triggering execution, you need to be able to pass data and results fluidly across flows. Consider this scenario: You want to refresh a model with a snapshot of historical data whenever new data is available. After this, you want to use the fresh model to produce predictions for the latest data.

It is natural to split the training and inference part into two separate flows. Organizationally, it allows the flows to be developed by separate people. Also, it provides a clean separation between outside-facing inference flows (like in the recsys example above) and internal training flows, which may run at their own cadence.

To produce predictions, InferenceModel needs to access the latest model version produced by TrainingFlow:

As shown above, TrainingFlow can use @trigger to trigger execution when new data is available, and SecondFlow can use @trigger_on_finish to trigger execution upon completion of TrainingFlow.

Metaflow makes it easy to access data across flows. In this case, SecondFlow can access the model trained by TrainingFlow simply by referring to it as follows:

self.model = current.trigger.run.data.model

Experiment with confidence

Imagine that you set up a chain of TrainingFlow and InferenceFlow as outlined above. The chain of flows works well in production but eventually you may want to experiment with a new model architecture.

If this was a business critical application, surely you would not dare to deploy the new challenger model in production without a comprehensive evaluation. Optimally, you would like to run the new model side-by-side with the existing model, conducting A/B testing, using real production data to get an accurate read.

Managing this level of complexity causes headaches with many platforms, as you need to support multiple concurrent, isolated variants of chains of workflows.

Metaflow supports this in a straightforward manner. Simply add the @project decorator in every flow:

@trigger_on_finish(flow='FirstFlow')
@project(name='variant_demo')
class SecondFlow(FlowSpec):
    ...

and deploy the challenger chain in a separate branch:

python secondflow.py --branch new_model argo-workflows deploy

The @project decorator sets up namespaces so that data, models, and triggering events are fully isolated in each variant, making sure that the experimental version can't interfere with the production environment.

Summary

As demonstrated above, event triggering provides a principled, human-friendly way to compose advanced ML and data systems from multiple interrelated workflows that can react to events in real-time. The feature allows you to manage complexity, providing a well-defined way to divide work across teams and to compose systems from independent components.

The end result is a fully functional system with a clear operational runbook. You can deploy and test variants of workflows concurrently, each variant being safely managed by Metaflow. You can operate the system with a high-level of observability, thanks to end-to-end lineage of artifacts and events which are automatically tracked by Metaflow.

Next steps

You can start using event triggering right away! Here’s what to do next:

  1. Sign up for a Metaflow sandbox to see and test event triggering in action without having to install anything locally. We have updated the onboarding lessons to include a new section that showcases event-triggering in action.
  2. Learn more in our new documentation for event triggering.
  3. Once you are ready to start using the feature in production in your own environment, you can get going with our standard deployment templates for Kubernetes.
  4. For business-critical use cases, consider adopting the Outerbounds Platform that provides the complete Metaflow stack as a managed service, including a highly available event backend.

If you have questions or need help with anything, you can join us and thousands of other data scientists and engineers on Metaflow Slack!

Authors

Start building today

Join our office hours for a live demo! Whether you're curious about Outerbounds or have specific questions - nothing is off limits.


Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.