Delightfully easy, large-scale compute across environments with @slurm and @kubernetes

You can now start using Slurm with Metaflow by installing the new metaflow-slurm extension. Learn how it fits into the compute platform provided by Outerbounds, which allows you to move workloads across environments easily.

Open-source Metaflow has been powering large-scale cloud computing at Netflix and other companies for years. As of today, just Netflix alone has persisted over 10PB of Metaflow artifacts, spanning thousands of projects and millions of executions.

A key factor behind Metaflow's success is its fanatic focus on the developer experience across the entire stack of common needs, empowering data scientists and ML/AI developers to scale their compute workloads without having to learn new paradigms or abandon their favorite frameworks.

No-nonsense compute, often at scale 

Rather than proposing a new, all-encompassing paradigm for distributed computing, we've adopted a pragmatic approach by providing the fundamental building blocks for scalability — specifically, provisioning cloud instances and clusters — while managing the underlying technical complexities.

On top of this foundation, we provide core services such as fault tolerance and checkpointing, and adapters for various frameworks like PyTorch, Deepspeed or MPI for distributed workloads, first-class support for dependency management, including automatic containerization, as well as high-throughput IO paths.

As a result, companies like Artera are able to run demanding HPC workloads over thousands of large GPU instances. Or, others leverage the distributed map support in AWS Step Functions to orchestrate massive-scale parallelism with AWS Batch, which Metaflow supports natively.

Expanding the compute footprint

While cloud-native companies like Netflix and many others have been able to maintain most of their compute footprint in a single cloud, we have been increasingly hearing from companies that want to access compute capacity across clouds, including on-premise datacenters.

While the design of Metaflow lends itself seamlessly to multi-cloud and hybrid compute - the user can just request the @resources they need without worrying about the underlying infrastructure - computing across environments presents a number of non-trivial technical challenges. Namely, on the backend side, you need to provide a unified, cross-cloud mechanism for

  • Security, authentication, and authorization,
  • Observability,
  • Workflow orchestration,
  • Queuing and workload management

to ensure that the end-user has a frictionless experience. We have been building these features into Outerbounds, backing Metaflow’s delightful developer APIs with a robust and scalable compute platform:

Outerbounds allows you to leverage your existing compute clusters, as well as burst to multiple clouds easily, through a unified set of APIs. Should you need special hardware resources that are not readily available on AWS, Azure, or GCP, we provide seamless integrations to GPU providers like CoreWeave and NVIDIA DGX Cloud.

Introducing @slurm

Our main workhorse for workload management is highly optimized Kubernetes, securely deployed in your cloud account. Over the past years, we have worked with many organizations who are leveraging (or would like to leverage) hybrid compute environments, including their existing Slurm clusters. These clusters can run either in the cloud using managed Slurm services such as AWS Parallel Computing Service or Azure CycleCloud, or they can be backed by on-premise infrastructure.

You can now connect these clusters in Outerbounds, embedding them in the unified compute fabric. This allows you to:

  • Provide a delightful, modern developer experience to all users, including the ones who are not HPC or distributed systems experts by training.
  • Speed up development by making it easy to develop and test code locally, run at scale without any changes in the code, sparing resources in the main Slurm cluster.
  • Unify compute across environments, joining on-premise and cloud compute, as well as clusters managed by Slurm and Kubernetes.
  • Enhance your Slurm clusters with additional capabilities, such as observability, workflow orchestration, and versioning and tracking provided by Metaflow out of the box.
  • Ensure security and compliance by relying on the centralized authentication, authorization, and security layer included in Outerbounds.

When it comes to serious HPC workloads, the unified approach to compute allows you to leverage MPI, gang scheduling, and various queuing options built in Outerbounds.

Slurm made delightful 

It is hard to overstate the importance of the delightful user experience. It feels magical to be able to run compute across Slurm and other cloud resources with simple code like this:

class SlurmFlow(FlowSpec):

    @kubernetes(cpu=16, memory=64000, target="azure")
    @step
    def preprocess_data(self):
        self.batches = self.preprocess(self.load_batches())
        self.next(self.process_data, foreach='batches')

    @slurm(**CONFIG)
    @resources(memory=16000, cpu=2)
    @card(type='blank', refresh_interval=1)
    @step
    def process_on_slurm(self):
        p = ProgressBar(max=self.num - 1, label='Number of files processed')
        current.card.append(p)
        for fname in self.input:
            self.process(fname)
            p.update(i)
            current.card.refresh()
        self.next(self.join)

This example runs a step, preprocess_data, on Azure utilizing 16 CPU cores and 64GB of RAM to preprocess a dataset to be processed on Slurm. The subsequent step, process_on_slurm is executed on a Slurm cluster, which may be on-prem, parallelizing processing over multiple concurrent Slurm jobs, thanks to Metaflow’s foreach construct.

The developer doesn’t have to worry about packaging the code nor moving data across environments, as Metaflow takes care of it under the hood. The example also demonstrates observability in action by showing a ProgressBar that updates in real-time as data is being processed on Slurm.

Running the code couldn’t be more straightforward:

Try it out it in your own environment

To get a taste of Metaflow-on-Slurm, you can simply install the new @slurm extension:

pip install metaflow-slurm

Follow the instructions in the extension’s PyPI page to see how to set up access to your Slurm cluster and authentication to the resources used by Metaflow manually. If you need any help setting it up, we are happy to help you at the Metaflow community Slack.

Once you are ready to get more serious about the setup and/or you have more sophisticated needs, it is easy to deploy Outerbounds in your cloud account, which provides all the benefits outlined above – a proper authentication layer in particular. We are happy to show you how - just send us a note to get started.

Authors

Start building today

Join our office hours for a live demo! Whether you're curious about Outerbounds or have specific questions - nothing is off limits.


Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.