Today, we are releasing a major new capability in Metaflow: You can now use Metaflow for distributed high-performance computing in general, large-scale training for AI in particular. We do this by providing infrastructure primitives for easy to use, cost-efficient compute in various clouds, which we have integrated to popular frameworks like PyTorch, Ray, Tensorflow, Deepspeed, and MPI. Read this article for a high-level overview and the second part for technical details.
As every leader of a data organization knows, data sitting in a data warehouse or a lake has rather limited value. Data becomes valuable when you do something with it: Transform, query, analyze, and expose it to power products and processes.
Inevitably, doing things with data requires compute cycles, which is why compute is a foundational layer of the ML/AI stack:
Existing data platforms are very capable at doing things with data, in particular when you can express the desired operations in SQL, which is the lingua franca for transforming, querying, and projecting facets of data. Alongside SQL, machine learning workloads have existed as another modality for “doing things with data”, but at most places they have been a marginal activity compared to the bread-and-butter data engineering, business intelligence, and analytics.
The Next Era of Data and Compute
With the advent of large language models, for the first time in decades, we have a whole new paradigm for leveraging data. Although in 2023 we have only scratched the surface of what will be possible, early signs suggest that the new techniques have potential to produce more value over time than traditional SQL-based workloads.
Upsides of this are hard to overestimate in the long term. A downside is that ML, and the latest AI techniques in particular, are extremely compute-hungry. Building an index takes a fraction of compute compared to training a model and running an average SQL query is trivially cheap compared to inference.
To make the situation more challenging, the new requirements can’t be addressed simply by adding a few GPUs behind a data warehouse. The new workloads are more diverse and fragmented, ML/AI frameworks evolve faster than anything in the old data world, and the underlying hardware landscape will experience tectonic shifts as companies want to benefit from ML and AI without exploding their R&D budgets.
In short, the infrastructure stack for data and compute is changing. Now is a good time to start thinking how you are going to handle the new ML/AI workloads, alongside the existing SQL-based ones.
Compute Everything with Metaflow
Since its inception, a killer feature of Metaflow has been its straightforward attitude towards compute. It doesn’t require you to learn a new paradigm like Hadoop or Spark, although you can keep using them with Metaflow. Rather, it makes it trivially easy to run Python functions, first locally on your laptop, and then in the cloud without fuss: Just
run --with kubernetes
Crucially, Metaflow recognizes the fact that compute never exists in isolation: Functions are not islands - they form workflows of interdependent steps. They don’t spin cycles in a void but they process data that needs to flow through the steps effortlessly. And, you want to track, debug, and observe everything on the way.
As a result, you can architect and operate a new class of compute and data-intensive applications leveraging the massive Python ecosystem of AI, ML, and data tools without headaches. You can stay focused on data, models, and business logic, avoiding finicky abstractions and constrained environments.
Choose a toolbox, not a hammer
There can’t be a one-size-fits-all approach to compute. You should be able to choose the right approach based on the stage of the project, the size of data and models, and other non-functional requirements like cost, SLA, latency, and throughput.
Consider the journey of a typical Metaflow project:
You can start with local compute on your laptop or a cloud workstation (such as the ergonomic and secure ones included in Outerbounds) providing the fastest iteration speed during development:
Once you hit the limits of your local environment, you can simply request more resources for your functions, that is, scale vertically.
The power of this straightforward approach shouldn't be underestimated these days, as this allows you to access the latest and greatest GPUs in the cloud of your choosing or on-premise - without changing anything in the code.
Or, you can trivially execute functions performing different operations on the same data in parallel, aka task parallelism, each with their own resource requirements:
When the job requires larger scale parallelism, say, to perform the same operation on many shards of data, aka data parallelism, or train many models in parallel scaling horizontally, you can express it simply by giving Metaflow a list to process.
Metaflow will spawn independent tasks in the cloud, even thousands of them, to execute the job:
New in Metaflow: Compute clusters on the fly
Our vision for compute turned out to be prescient with the explosion of Generative AI in 2023, large language models in particular, which skyrocketed demand for fine-tuning on GPUs. For the past few years, we have been developing support for even more advanced compute patterns, to augment the compute toolbox provided by Metaflow.
Autodesk is a great example of a company that sees massive opportunities in applying generative AI techniques in real-world problems at a very large scale. We collaborated with them to enable support for a new class of compute in Metaflow: Distributed compute involving interconnected tasks, based on an ephemeral, tightly connected compute cluster, created on the fly:
This pattern has been utilized in high-performance computing (HPC) for decades, often employed by custom-built software utilizing MPI, the Message Passing Interface standard that Metaflow supports now too! It has seen resurgence with large AI models, like the GPT and Dall-E models trained by OpenAI, which can’t fit on a single server.
Metaflow for AI
To learn more about today’s announcements, why Autodesk chose Metaflow, and how Metaflow and Outerbounds can help you build compute-intensive systems, join a live release event on November 6th, 2023 at 2pm PT or watch a recording after the event:
To give you a quick idea of the benefits, here are some examples of how Metaflow helps you ship production-ready AI and ML projects faster:
If you use proprietary APIs like OpenAI e.g. with the Retrieval Augmented Generation (RAG) pattern, you can leverage Metaflow to build a system that produces embeddings on the fly using GPUs, populating a vector database.
If your use cases involve image generation e.g. using variants of Stable Diffusion, Metaflow helps generating images at scale cost-effectively.
If you want to fine-tune an open-source foundation model with your own data, securely on your own infrastructure, Metaflow helps provision GPUs for the task - either by relying on many GPUs on a single node or smaller, more widely available GPUs using the new distributed training functionality.
If you want to train your own foundation models from scratch using a fleet of GPUs, e.g. using Ray or PyTorch Distributed, the new support for distributed training can make that happen.
Or, outside generative AI, you may want to modernize your existing high-performance computing workflows powered by MPI, integrating them to modern cloud environments, data warehouses, and experiment tracking layer.
Notably, the new distributed training functionality is relevant for smaller use cases as it allows you to leverage cheaper, more available GPUs, making LLM fine-tuning more accessible for all.
New Metaflow extensions for distributed training
In all these cases, Metaflow provides easy access to compute, but it doesn’t dictate the software libraries you use, which are likely to evolve quickly over the coming years. Developers require flexibility at the top of the stack:
Today, we release new Metaflow extensions for Ray, Deepspeed, PyTorch, Tensorflow, and MPI, so data scientists and ML engineers can use their favorite tools for the job without having to focus on infrastructure. For more information about the new extensions, see this article that dives into technical details.
Freedom to compute
Metaflow delights infrastructure engineers too. It allows them to configure access to various compute providers while providing a single, cloud-agnostic interface to their ML/AI teams:
Due to the heterogeneous nature of the next era of compute, sometimes it is cost-efficient to leverage on-premise compute resources, sometimes Azure, AWS or GCP, and sometimes specialized GPU clouds like CoreWeave that we have been partnering with for large-scale GPU workloads. One size doesn’t fit all.
By taking this approach, you can radically decrease your compute costs, as you can avoid using compute resources with hefty markups, like those supplied by many data warehouses, AI vendors, and services like AWS Sagemaker or Google Vertex.
Importantly, flexibility doesn't mean that you have to compromise user experience or enterprise concerns of governance and security: With Outerbounds, you can mix, match, and leverage the lowest cost compute directly without markup. Our vision is to treat compute as a low-level commodity which you should be able to use freely to fuel innovation. Humans are expensive, computers are not.
Start developing full-stack AI systems today
While compute matters, it is only one layer of the stack.
You need to feed data to compute quickly and securely, orchestrate multiple steps of compute, while tracking all work that happens. Furthermore, systems you build are not islands - they needs to be deployed and integrated to your products and services. Metaflow provides the easiest way to develop data- and compute-intensive systems with a full-stack, developer-friendly API, which you can then run reliably on Outerbounds, seamlessly integrating to your infrastructure.