Metaflow Now Supports All the Major Clouds: Google Cloud, Azure, and AWS

Use Metaflow equally on all the major clouds, Google Cloud, Azure, or AWS without any changes in the code as shown in this 90 second video (watch on YouTube):

Training a decision tree model with parallelized hyperparameter search on GCP, Azure, and AWS

Today, we are releasing support for open-source Metaflow on Google Cloud. This completes our journey to support Metaflow on all the major clouds, starting with AWS-native integrations in 2019 and Kubernetes in 2021 which unlocked support for Azure a few months back, and now, Google Cloud.

The journey has been motivated by our original vision for Metaflow: Data scientists and ML engineers are asked to develop and deploy models for a diverse set of ML-powered use cases. In order to do this, they need to leverage foundational infrastructure – data access, scalable compute, workflow orchestration, all tracked and versioned to allow efficient collaboration and incremental improvements.

Rather than using six separate tools for the job, a more effective solution is to use a cohesive toolchain with a human-friendly API, providing support throughout the project lifecycle.

The full stack of data science and ML infrastructure from the point of view of a data scientist and an engineer

The Cloud-Native Stack for ML

Data science and ML can’t live on an island of their own. To produce sustainable business value, ML infrastructure must integrate in the same cloud environment that is already used by the rest of the organization.

Fortunately all the major clouds – Azure, AWS, and GCP – provide mature, cost-effective, functionally equivalent solutions for the foundations of this stack, so companies can choose their cloud vendor based on other business considerations, including pricing. In particular, scalable data storage and compute have become commodities, which is a boon to resource-hungry ML workloads.

The low-level APIs provided by the clouds are not suitable for human consumption directly. However, when wrapped in a human-friendly framework like Metaflow, the cloud frees the data scientist from the confines of their laptop or running notebooks on a single instance. From the productivity perspective, it is highly beneficial to convert cheap CPU cycles to human happiness and effectiveness.

A main goal of Metaflow is to make projects more cloud-native and hence more production-ready by design, so they can be deployed to production quickly and frequently. Data scientists can execute more experiments concurrently and develop flows that are scalable and robust by default, avoiding expensive rewrites and tedious handovers from data scientists to engineers.

Metaflow on Google Cloud

For an idea how Metaflow works on Google Cloud, take a look at this short video:

Behind the scenes, Metaflow uses an auto-scaling GKE cluster to execute your models and business logic. You can use any Google Compute instances in your GKE cluster, including GPU instances. The workflow state and its results are persisted in Google Cloud Storage.

By design, these technical details are hidden from data scientists focusing on models and data, as illustrated below. Also by design, the technical details are not heavy-handedly abstracted away from engineers who want to stay in control, and who need to make sure that ML applications adhere to the company’s security and other policies, and integrate to surrounding systems seamlessly.

Workflow orchestration is handled by Metaflow’s integration with Argo Workflows and soon, Airflow. Versioning and metadata tracking are handled by Metaflow’s built-in metadata service. All of these services are deployed on GKE as a part of the stack automatically.

Thanks to Metaflow’s openness and extensibility, you can easily use it with other Google services, such as Cloud TPUs for deep learning, Vision API for image recognition, or BigQuery for data processing. Importantly, you can develop on Google managed notebooks or locally on your favorite IDE.

To get started with Metaflow on GCP, simply apply our open-source Terraform templates following the instructions at Metaflow Resources for Engineers. Deploying the full stack on your GCP account takes about 15-30 minutes.

Metaflow vs. Bundled ML Platforms

We get this question often: Why should I use Metaflow instead of Sagemaker on AWS, Azure ML Studio on Azure, or Vertex AI on Google? All these platforms provide a full stack of infrastructure for data science and ML, conveniently bundled in your existing relationship with a cloud vendor.

Consider the following reasons:

  • Human-friendly, Pythonic user experience – Metaflow provides a coherently designed, concise Python API that is easy to understand and learn by anyone who has used Python in notebooks in the past. The bundled platforms tend to be complex portfolios of loosely coupled, and sometimes quickly evolving components, each with their bespoke paradigms, style of configuration, and SDKs.
  • Speed of iteration – Metaflow treats prototyping and the local development experience as the first class concern. You can edit, run, and debug code locally on your laptop or favorite development environment and scale to the cloud only when needed.
  • Extensibility – Proprietary ML platforms are opaque black boxes. They can be great when they work but if they don’t, your options are limited. An open-source framework like Metaflow is fully transparent and infinitely extensible based on your bespoke needs and business requirements. Teams can build their own company-specific abstractions on top of Metaflow, as Netflix, Zillow, and SAP have done, for instance.
  • Stability – Metaflow provides a strong backwards compatibility guarantee: Metaflow flows written in 2018 run without changes four years later, and they will continue doing so in the future. This is a critical feature of Metaflow as it powers business-critical workflows at hundreds of companies. In general, companies prefer open-source as their application backbone, as it allows them to avoid vendor-controlled systems that change out of their control, causing unnecessary migration tax.
  • Understandable costs  – Vertex AI and Sagemaker impose an additional surcharge of 10-20% over standard instance pricing for ML workloads. In addition, every component of the stack incurs additional costs, so estimating the total cost of ownership of bundled platforms can be tricky. With Metaflow, you only pay for the basic compute and data which are likely to be line items in your cloud bill already. Furthermore, you can optimize your costs freely, e.g. by using spot/ephemeral instances or by leveraging volume discounts.
  • Avoid cloud lock-in – the abstractions provided by each cloud are highly bespoke. For instance, migrating an application built on, say, Sagemaker, would require a nearly complete rewrite to run on Vertex AI. Migrating an application to run on your own infrastructure, say, a Kubernetes cluster, would be even harder.

Thankfully, the modern clouds provide a level playing field when it comes to the foundational layers, data and compute. The bundled platforms rely on the same underlying services such as EC2 and S3 as everyone else, so open-source frameworks like Metaflow can provide an equal level of high availability and scalability as them, addressing the most critical technical consideration.

Also, the choice doesn’t have to be either-or: you can use specific bundled services, say, Vertex AI Data Labeling, or Sagemaker for model hosting in conjunction with Metaflow and evolve the service mix over time as your needs grow.

Get Started Easily

To get started with Metaflow on Google Cloud – or any other major cloud – follow the instructions here. To learn more about Metaflow, see our resources for data scientists or try Metaflow in the browser!

Most importantly, join our welcoming community of 1800+ data scientists, ML and platform engineers on Slack. We look forward to hearing your questions and feedback there!

Authors

Start building today

Join our office hours for a live demo! Whether you're curious about Outerbounds or have specific questions - nothing is off limits.


Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.