You can now schedule and orchestrate workflows developed with Metaflow on Apache Airflow.
In data engineering, Apache Airflow is a household brand. It is easy to see why many companies get started with it: It is readily familiar to most data engineers, it is quick to set up, and as proven by millions of data pipelines powered by it since 2014, clearly it can keep DAGs running.
The same data pipelines that contribute to Airflow’s popularity have also contributed to countless hours of debugging and missed SLAs, revealing fundamental issues in Airflow’s design. The widely documented issues can be summarized in two categories: suboptimal developer experience and operational headaches, discussed below.
Today, we are releasing support for orchestrating Metaflow workflows using Airflow. The integration is motivated by our human-centric approach to data science: Still today, many data scientists, ML engineers, and data engineers are required to use Airflow. We want to provide them with a better user experience and a stable API, which allows them to develop projects faster and start future-proofing their projects with minimal operational disruption.
Can we fix Airflow?
While walking through a DAG seems like a schoolbook exercise, it is easy to underestimate the number of engineering-years and battle scars that it takes to build a real-world, production-grade workflow orchestrator. These challenges are not limited to large companies. In a modern experimentation-driven culture every variant counts, so even smaller companies can accumulate surprisingly many workflows quickly. In 2023, developing and deploying a new workflow variant should be as easy as opening a pull request.
Each workflow can spawn thousands of tasks – imagine conducting a hyperparameter search as a part of a nightly model training workflow. And every workflow and task needs to be executed in a highly available manner while reacting to a torrent of external events in real time. We talked about these topics at length when we released our integration to another production-grade orchestrator, AWS Step Functions.
Herein lies the root cause of many issues in Airflow: It served its original use cases well but its design and architecture are not suitable for the increasing demands of modern data (science) stacks. Airflow is perfectly capable of orchestrating a set of basic data pipelines, but with the increasing demands of ML and data science – and more modern data organizations – its cracks are becoming visible.
Fixing these issues while maintaining backward compatibility with the millions of existing Airflow pipelines is nigh impossible. Airflow will surely keep improving, as it did with the major release of Airflow 2.0, but migrating existing pipelines to new untried APIs is not necessarily easier than migrating to another, more modern orchestrator.
As a result, many companies find themselves in a pickle: They have a hairball of business-critical data pipelines orchestrated by Airflow, encapsulating years of accumulated business logic. At the same time, they are becoming increasingly aware that the system is slowing down their development velocity and causing avoidable operational overhead.
Develop with Metaflow, deploy on Airflow
We want to provide a new path for teams that find themselves in this situation. Our new Airflow integration allows you to develop workflows in Metaflow, using its data scientist-friendly, productivity-boosting APIs, and deploy them on your existing Airflow server, as shown in the video below:
Deploying Metaflow to Airflow requires no changes in the code - just a single command:
python flow.py airflow create airflow_dag.py
Operationally nothing changes. The resulting workflows will get scheduled as any other Airflow workflows and they live happily side-by-side with your existing Airflow-native workflows. Under the hood, Metaflow translates its flows to Airflow-compatible DAGs automatically, so the operational concerns are invisible to data scientists who can benefit from the features of Metaflow.
Consider the benefits of using Metaflow for workflow development compared to Airflow:
- Faster development speed thanks to local testing and debugging experience: You can develop and execute Metaflow workflows locally, resume executions at any point, and use idiomatic Python for structuring your projects. This is a key reason why companies choose Metaflow.
- Scalable compute – Airflow was never designed for high-performance, large-scale compute, which is critical for data science and ML workloads. In contrast, Metaflow treats both horizontal and vertical scalability as a first-class citizen.
- Seamless data flow – It is not very convenient to maintain state across steps through the XCOM mechanism in Airflow, so steps tend to be rather isolated units of execution. In contrast, data artifacts are a key construct in Metaflow, making it more natural to express stateful, data-centric business logic as a workflow.
- Built-in versioning and tracking – Metaflow tracks and versions all executions, artifacts, and code automatically, enabling experiment tracking, model monitoring, model registries, and versatile experimentation out of the box.
All these features are critically important for data science and ML projects and they also pair well with the modern data stack.
The Airflow integration in Metaflow allows you to benefit from nearly all features of Metaflow, while using a central Airflow server to orchestrate all workflows. This allows data and platform engineers to schedule, monitor, and operate all DAGs on your existing Airflow deployment without changes, while allowing data scientists to benefit from a modern toolchain that matches their needs.
Crucially, the integration doesn’t try to provide complete coverage of all Airflow features. Airflow comes with a myriad of APIs, operators, and sensors, many of which are not relevant for most data scientists and new projects. A key feature of the integration is that you can run native Airflow pipelines and newly authored Metaflow projects side-by-side, so if a feature is missing in the integration, you can implement the workflow natively for Airflow as before.
We are excited to support basic foreaches, aka dynamic task mapping in Airflow, as they are a key construct in Metaflow (you need Airflow 2.3 or newer to use them). However, you will get a clear error message if you try to deploy Metaflow flows containing nested foreaches, as Airflow doesn’t support them yet. Also, only @kubernetes is supported as a compute layer, which maps to KubernetesPodOperator in Airflow at the moment.
Since Metaflow works with all the major clouds, AWS, Azure, and Google Cloud, Metaflow's Airflow integration works in these clouds out of the box, when you use your own self-managed Airflow server.
Besides self-managed Airflow, we have tested the integration with AWS’ managed Airflow, Amazon MWAA, which works well with Metaflow too. If you don’t have legacy reasons to use Airflow, Metaflow’s existing integrations with AWS Step Functions or Argo Workflows will likely provide a better experience, as they come with fewer limitations (in particular, nested foreaches are supported).
Develop with Metaflow, choose where to deploy
At some point, you may want to explore whether an alternative workflow orchestrator would provide better scalability characteristics, higher availability, and better developer experience than Airflow in your environment.
Workflows written using Airflow’s APIs are naturally only compatible with Airflow, so orchestrating them elsewhere is not an option. However, workflows developed with Metaflow can be orchestrated by any system supported by Metaflow, like AWS Step Functions or Argo Workflows, without any changes in your code, as shown in the video below:
This means that you can start authoring workflows with Metaflow today and deploy them initially on Airflow, avoiding a massive migration project upfront. Over time, as more workflows are written with Metaflow, you are able to start testing them on, say, Argo Workflows without any migration tax.
If the results are promising, you can run Argo Workflows and Airflow side by side – soon you will be able to trigger Argo workflows from Airflow via events – again minimizing the need for abrupt changes. You can keep operating with this pattern as long as needed, for instance, if data engineers want to keep using Airflow, or you can use this as a smooth migration path to a modern workflow orchestrator.
Get started today
It is easy to get started with the new Airflow integration. Simply install Metaflow as usual
pip install metaflow
write a flow, save it in
flow.py, and execute
python flow.py airflow create airflow_dag.py
and deploy the resulting Airflow DAG to your Airflow server!
If you have any questions or you want to share your thoughts or feedback about the integration, join us and over 2000 data scientists and engineers in the Metaflow community Slack workspace!