Yesterday, we released two major new features helping you develop ML/AI faster. Today, we are releasing a set of foundational features focusing on scaling:
@huggingface_hub
for loading and caching large (foundation) models,@model
for storing and registering models,@checkpoint
for robust training jobs,- Improved support for distributed training with
@torchrun
and other framework-specific decorators, and @slurm
for integrating existing HPC clusters into your compute pools
The first three decorators are now available in open-source, benefitting all Metaflow users. To get started, take a look at this repository and pip install metaflow-checkpoint
.
The new features are seamlessly integrated into Outerbounds, enhancing our unified approach to compute. For a broader overview and more details on the decorators, keep reading.
The nature of AI/ML compute
Much has been written about the compute needs of AI. Much of the discussion has focused on the massive capital investments in the (GPU) hardware, driven by the needs of pretraining of large foundation models.
While these topics grab flashy headlines, they represent only a small fraction of real-world ML/AI projects. Most companies don’t train - and won’t need to train - large foundation models. In terms of t-shirt sizes, instead of an XXXL cluster, they might need an XL, M, or S.
Instead of indiscriminately crunching everything published online, these companies train sophisticated, domain-specific models, such as those needed to optimize energy efficiency or to cure cancer. The use cases are countless and varied.
Most importantly, companies like these need a compute stack that allows them to experiment quickly, operate the stack with minimal overhead, and avoid breaking the bank.
The full stack of compute
To fulfill these requirements, it is beneficial to consider the full stack of hardware and software involved in model training and data processing in general:
- The hardware layer at the bottom used to be rather homogeneous, but that is changing quickly. After decades of x86 CPU dominance, the diversity of compute hardware is surging: From ARM, NVIDIA GPUs, and cloud-specific hardware such as AWS Trainium and Google TPUs to a long tail of specialized AI chips.
- The infrastructure layer is largely occupied by the cloud hyperscalers, AWS, GCP, Azure, and increasingly by challengers such as Oracle, CoreWeave, Lambda Labs and many others.
- To run your workloads, you need to containerize them for execution. This includes device drivers such as CUDA and all 3rd party library dependencies, such as PyTorch.
- On top of the stack, you have your custom logic and models which leverage the full stack underneath.
Outerbounds helps you manage and operate the purple shaded area. By taking care of the middle parts, itallows you to focus on custom models and workloads, leveraging the state of the art hardware. You are able to
- unify infrastructure providers into a single compute platform, smoothly moving compute between clouds in a cost-efficient manner. In addition, you can mix in specialized providers such as GPUs directly from the NVIDIA cloud, and even on-prem resources.
- leverage automatic containerization using Metaflow’s dependency management APIs and the new Fast Bakery feature released yesterday.
And finally, define your workloads using Metaflow’s human-friendly APIs that allows you to access compute in various shapes and sizes.
New: Finetune foundation models with @model
and @huggingface_hub
A whole new class of workloads has emerged over the past few years. Instead of training a model from scratch, you start with an existing foundation model, e.g. a large language model (LLM) or a multimodal model, and finetune it for your own needs.
By far the most popular repository for public foundation models is HuggingFace. Hence, it is convenient to have native support for it in Metaflow, now available through a new decorator @huggingface_hub
that integrates seamlessly with another new decorator, @model
.
Consider this snippet that loads a video generation model, Stable Video Diffusion by Stability AI (which we have demonstrated in the past), and proceeds to finetune it in the @nvidia cloud using 4 H100 GPUs:
@huggingface_hub
@step
def start(self):
self.video_model = current.huggingface_hub.snapshot_download(
repo_id='stabilityai/stable-video-diffusion-img2vid',
)
self.next(self.finetune)
@model(load=["video_model"])
@nvidia(instance_type='H100', gpu=4)
@step
def finetune(self):
process(current.model.loaded(['video_model'])
At the first glance, this may not seem like much, but there's a lot happening behind the scenes:
- The foundation model is cached in Metaflow’s datastore, which is optimized for high throughput, allowing you to access the foundation model much faster than from HuggingFace directly.
- The model is registered and versioned automatically with a reference stored as a Metaflow artifact.
- You can load the model in any environment, across clouds, using the new @model decorator, which makes the model available for further processing.
To see and test the decorators in practice, take a look at the following examples:
- A bare-bones example that demonstrates basic functionality.
- parameter-efficient (LoRA) fine-tuning, and
- a stable diffusion example.
We will dive into details about managing models with @model
in future posts.
New: Train and finetune confidently with @checkpoint
Imagine running a more realistic version of the above snippet, fine-tuning a model on four H100 GPUs for several hours. Even a relatively modest job like this can cost several hundred dollars, and it doesn’t take much for the training costs to scale into the four-figure range.
A training job can fail for multiple reasons: transient infrastructure issues, logic errors, or bad data. Unless you persist snapshots of the model during training, that is, @checkpoint it, you risk losing time and money if the job fails.
To make it easier to checkpoint training jobs, we now provide a new @checkpoint
decorator that takes care of much of the boilerplate related to loading and saving checkpoints. Consider this simple example that trains a model with PyTorch, saving the model state as a checkpoint at every epoch using PyTorch’s standard model serialization mechanism (as shown in the train module):
start_epoch = 0
if current.checkpoint.is_loaded:
checkpoint_path = os.path.join(
current.checkpoint.directory, "best_model.pth"
)
start_epoch = int(
open(
os.path.join(
current.checkpoint.directory, "checkpoint_epoch_number"
)
).read()
)
self.best_loss, self.best_acc, self.latest_checkpoint = train(
checkpoint_path=checkpoint_path,
num_epochs=self.epochs,
model_save_dir="./model_checkpoints/" + current.task_id,
start_epoch=start_epoch,
)
Just by adding @checkpoint
, you can ensure that the training will automatically resume from the latest checkpoint, should the task fail for any reason. For transparency, @checkpoint
produces a card that visualizes checkpoints persisted over the course of task execution:
For a more interesting example, take a look at an example flow that fine-tunes a large LLama3 LLM to be used with NVIDIA NIM inference service, integrated in Outerbounds. Running the flow takes about three hours on an instance with eight H100 GPUs, which warrants the peace of mind offered by @checkpoint
.
Stay tuned for more details about @checkpoint
behavior in future posts and documentation!
New: Distributed training on Outerbounds
In most training scenarios, you can get far with a single large GPU instance. However, in certain cases there is so much training data - imagine, say, terabytes of video - that training on a single instance is not feasible. Or, the model may be simply too large to fit in an instance.
In cases like this you must resort to distributing training over a cluster of (multi-GPU) instances. In terms of the compute stack presented above, distributed training requires coordination between the infrastructure and the workload layer, as a whole set of instances needs to be gang-scheduled to execute in parallel.
A year ago, Metaflow gained support for a number of framework-specific decorators, like @torchrun
and @metaflow_ray
, which can be used to create ephemeral training clusters as a part of a Metaflow flow. In the initial release, the decorators supported distributed training on AWS Batch only.
After working with a number of demanding customers who have stress-tested distributed training on Outerbounds, we are happy to launch it as a generally available feature! This feature in particular benefits from the new @checkpoint and @model functionality. Under the hood, @checkpoint is tuned to work well in distributed environments, e.g. when using Distributed Data Parallel (DDP) training with PyTorch.
Our distributed training example that trains an image recognition model using the CIFAR-10 dataset, neatly demonstrates the new decorators in action:
@step
def start(self):
self.next(self.train, num_parallel=self.cluster_size)
@parallel_card
@retry(times=4)
@metrics_logger
@pypi(python="3.10", packages=packages)
@gpu_profile(interval=1)
@kubernetes(
cpu=8,
memory=16000,
gpu=NUM_GPUS_PER_NODE,
shared_memory=8000,
)
@model
@checkpoint
@torchrun(all_nodes_started_timeout=60 * 60)
@step
def train(self):
...
You can observe training in real-time in the Outerbounds UI, in this case on a cluster with eight nodes - the red square in the screenshot below representing the main control node and the rest are workers:
While there are a number of alternative ways to set up a distributed training cluster, a major benefit of Outerbounds is the features it offers around distributed training:
- You can define containers for training on the fly using Fast Bakery,
- you can embed (distributed) training as a part of a Metaflow flow, benefiting from all the other parts of Metaflow and Outerbounds, such as cards, event-triggering, tracking through artifacts, and managed production orchestration, and
- Clusters are subject to the same security policies, CI/CD workflows, and authentication as all the other parts of the system.
This means that you can build real-world ML/AI systems that include distributed training as a seamless component, instead of treating it as an exotic island.
New: Support for Slurm clusters with @slurm
Last but not least, our infrastructure support is expanding! You can now target Slurm clusters simply by adding @slurm
in your flows, as shown in this quick video:
This feature is especially useful for pharmaceuticals, research labs, and other HPC users who want to modernize their compute environment without costly migrations.
Simply add your existing Slurm cluster as an additional compute pool in Outerbounds. Leveraging the full suite of Outerbounds features, this enables you to instantly modernize your HPC setup, facilitating seamless hybrid computing between on-premise Slurm clusters and cloud resources.
Start building today
The @huggingface_hub
, @model
, and @checkpoint
decorators are available in open-source as Metaflow extensions. To get started, take a look at this repository. Note that we are actively working on these features, so the APIs may evolve over time. We would love to hear your feedback and questions at the Metaflow community Slack.
To experience all these features and the full stack of compute in your cloud environment without engineering effort, get started with Outerbounds in 15 minutes.
For more announcements, come back tomorrow! Also, if you're interested in building systems with GenAI specifically, don't miss our webinar with NVIDIA on Friday.