Skip to main content
Blog

Better Airflow with Metaflow

You can now schedule and orchestrate workflows developed with Metaflow on Apache Airflow.


In data engineering, Apache Airflow is a household brand. It is easy to see why many companies get started with it: It is readily familiar to most data engineers, it is quick to set up, and as proven by millions of data pipelines powered by it since 2014, clearly it can keep DAGs running.

The same data pipelines that contribute to Airflow’s popularity have also contributed to countless hours of debugging and missed SLAs, revealing fundamental issues in Airflow’s design. The widely documented issues can be summarized in two categories: suboptimal developer experience and operational headaches, discussed below.

Today, we are releasing support for orchestrating Metaflow workflows using Airflow. The integration is motivated by our human-centric approach to data science: Still today, many data scientists, ML engineers, and data engineers are required to use Airflow. We want to provide them with a better user experience and a stable API, which allows them to develop projects faster and start future-proofing their projects with minimal operational disruption.

Can we fix Airflow?

While walking through a DAG seems like a schoolbook exercise, it is easy to underestimate the number of engineering-years and battle scars that it takes to build a real-world, production-grade workflow orchestrator. These challenges are not limited to large companies. In a modern experimentation-driven culture every variant counts, so even smaller companies can accumulate surprisingly many workflows quickly. In 2023, developing and deploying a new workflow variant should be as easy as opening a pull request.

Each workflow can spawn thousands of tasks – imagine conducting a hyperparameter search as a part of a nightly model training workflow. And every workflow and task needs to be executed in a highly available manner while reacting to a torrent of external events in real time. We talked about these topics at length when we released our integration to another production-grade orchestrator, AWS Step Functions.

Herein lies the root cause of many issues in Airflow: It served its original use cases well but its design and architecture are not suitable for the increasing demands of modern data (science) stacks. Airflow is perfectly capable of orchestrating a set of basic data pipelines, but with the increasing demands of ML and data science – and more modern data organizations – its cracks are becoming visible.

Fixing these issues while maintaining backward compatibility with the millions of existing Airflow pipelines is nigh impossible. Airflow will surely keep improving, as it did with the major release of Airflow 2.0, but migrating existing pipelines to new untried APIs is not necessarily easier than migrating to another, more modern orchestrator.

As a result, many companies find themselves in a pickle: They have a hairball of business-critical data pipelines orchestrated by Airflow, encapsulating years of accumulated business logic. At the same time, they are becoming increasingly aware that the system is slowing down their development velocity and causing avoidable operational overhead.

Develop with Metaflow, deploy on Airflow

We want to provide a new path for teams that find themselves in this situation. Our new Airflow integration allows you to develop workflows in Metaflow, using its data scientist-friendly, productivity-boosting APIs, and deploy them on your existing Airflow server, as shown in the video below:


Deploying Metaflow to Airflow requires no changes in the code - just a single command:

python flow.py airflow create airflow_dag.py

Operationally nothing changes. The resulting workflows will get scheduled as any other Airflow workflows and they live happily side-by-side with your existing Airflow-native workflows. Under the hood, Metaflow translates its flows to Airflow-compatible DAGs automatically, so the operational concerns are invisible to data scientists who can benefit from the features of Metaflow.

Consider the benefits of using Metaflow for workflow development compared to Airflow:

All these features are critically important for data science and ML projects and they also pair well with the modern data stack.

Coexisting flows

The Airflow integration in Metaflow allows you to benefit from nearly all features of Metaflow, while using a central Airflow server to orchestrate all workflows. This allows data and platform engineers to schedule, monitor, and operate all DAGs on your existing Airflow deployment without changes, while allowing data scientists to benefit from a modern toolchain that matches their needs.

Crucially, the integration doesn’t try to provide complete coverage of all Airflow features. Airflow comes with a myriad of APIs, operators, and sensors, many of which are not relevant for most data scientists and new projects. A key feature of the integration is that you can run native Airflow pipelines and newly authored Metaflow projects side-by-side, so if a feature is missing in the integration, you can implement the workflow natively for Airflow as before.

We are excited to support basic foreaches, aka dynamic task mapping in Airflow, as they are a key construct in Metaflow (you need Airflow 2.3 or newer to use them). However, you will get a clear error message if you try to deploy Metaflow flows containing nested foreaches, as Airflow doesn’t support them yet. Also, only @kubernetes is supported as a compute layer, which maps to KubernetesPodOperator in Airflow at the moment.

Since Metaflow works with all the major clouds, AWS, Azure, and Google Cloud, Metaflow's Airflow integration works in these clouds out of the box, when you use your own self-managed Airflow server.

Besides self-managed Airflow, we have tested the integration with AWS’ managed Airflow, Amazon MWAA, which works well with Metaflow too. If you don’t have legacy reasons to use Airflow, Metaflow’s existing integrations with AWS Step Functions or Argo Workflows will likely provide a better experience, as they come with fewer limitations (in particular, nested foreaches are supported).

Develop with Metaflow, choose where to deploy

At some point, you may want to explore whether an alternative workflow orchestrator would provide better scalability characteristics, higher availability, and better developer experience than Airflow in your environment.

Workflows written using Airflow’s APIs are naturally only compatible with Airflow, so orchestrating them elsewhere is not an option. However, workflows developed with Metaflow can be orchestrated by any system supported by Metaflow, like AWS Step Functions or Argo Workflows, without any changes in your code, as shown in the video below:


This means that you can start authoring workflows with Metaflow today and deploy them initially on Airflow, avoiding a massive migration project upfront. Over time, as more workflows are written with Metaflow, you are able to start testing them on, say, Argo Workflows without any migration tax.

If the results are promising, you can run Argo Workflows and Airflow side by side – soon you will be able to trigger Argo workflows from Airflow via events – again minimizing the need for abrupt changes. You can keep operating with this pattern as long as needed, for instance, if data engineers want to keep using Airflow, or you can use this as a smooth migration path to a modern workflow orchestrator.

Get started today

It is easy to get started with the new Airflow integration. Simply install Metaflow as usual

pip install metaflow

write a flow, save it in flow.py, and execute

python flow.py airflow create airflow_dag.py

and deploy the resulting Airflow DAG to your Airflow server!

For more information, take a look at our user-facing documentation for Airflow and Using Airflow with Metaflow for engineers.

If you have any questions or you want to share your thoughts or feedback about the integration, join us and over 2000 data scientists and engineers in the Metaflow community Slack workspace!

Smarter machines, built by happier humans

The future will be powered by dynamic, data intensive systems - built by happy humans using tooling that gives them superpowers

Get started for free