Table of Contents

New: Faster Cloud Compute

April 11, 2025

Most cloud compute environments today use a generic stack - standard container images, off-the-shelf registries, and unoptimized Kubernetes setups - leading to suboptimal performance for ML/AI workloads. In this post, we share benchmarks from our latest optimizations that cut task startup latency in your cloud account by 4x or more.

Imagine you are an ML/AI developer working on a computationally intensive project - maybe it is data processing, model training, or a cutting-edge AI agent that demands significant compute for reasoning. Building any of these workflows involves countless cycles of trial and error, so to minimize frustration and to maximize speed, each iteration should run quickly.

To illustrate this scenario watch the video below, which demonstrates a typical use case - running a model built with PyTorch and comparing its startup time across three different environments:

In all three cases, the workload runs on a fresh instance without any cached container images, reflecting a common scenario where new instances are launched on demand. The differences are palpable:

In the first case, we use Outerbounds’ automatic containerization mechanism, Fast Bakery, to bake a custom image including PyTorch, optimized for the workload at hand. The task completes in 23 seconds.
In the second case, we use Outerbounds’ optimized compute pools to pull a large, off-the-shelf image from ECR which includes PyTorch and other ML/AI libraries. The task completes in 33 seconds.
The last case represents the baseline of using open-source Metaflow with AWS Batch, again pulling the same image from ECR. The task completes in 83 seconds.

Crucially, all workloads run entirely within your own cloud account, ensuring that no data or compute resources ever leave your environment. If you're looking to reduce task startup latency by 4x or more in your cloud setup, read on!

Running cloud tasks fast (without idling instances)

We want to make tasks executing in the cloud feel as responsive as running code on your laptop or in a local notebook. Since Outerbounds is always deployed within your own cloud environment, maintaining pools of pre-warmed instances - while effective at reducing latency - is often impractical due to significantly increased cloud costs. We still prioritize running workloads cost-efficiently.

To optimize the system end-to-end, it is beneficial to look at the big picture. As of today, practically all container-based compute environments, such as AWS Batch and Kubernetes, involve three core components - an image builder, a container registry, and a container runtime - which prepare and serve standard container images for execution.

These components are typically operated as independent services that you can mix and match - depicted as gray boxes below. While flexible, this makes it harder to optimize performance across the full pipeline. Contrast this to the streamlined fast path, integrated in Outerbounds, which spans the entire container lifecycle, from task definition to packaging and fast execution in the cloud.

Let’s walk through the container lifecycle:

First, your task and its execution environment need to be packaged as a container image. This image building step is typically handled through a CI/CD pipeline, such as GitHub Actions. Outerbounds integrates the step seamlessly into the platform with Fast Bakery, so developers don’t have to interact with separate systems to execute their code in the cloud - and we can create images in seconds, optimized for fast execution.
The resulting image is pushed to a container registry, either to a cloud-provided one such as Amazon Elastic Container Registry (ECR), or a separate service such as Docker Hub. Outerbounds ships with a built-in Fast Registry that is tightly integrated with our new high-throughput container runtime, as highlighted in the benchmarks below.
The image is pulled for execution in a compute environment such as Kubernetes or AWS Batch. Since Kubernetes was originally designed to orchestrate long-running microservices, it doesn't natively optimize for low startup latency. As a result, platforms like Outerbounds and others implementing Function-as-a-Service (FaaS) patterns on Kubernetes must introduce their own latency optimizations - such as our new Fast Container Runtime.

New: Fast Container Runtime

We had started our optimization efforts with Fast Bakery that makes it trivial for developers to build and push images to a container registry on the fly - without having to write Dockerfiles or much think about it at all. It is magical to see giant images - often 10GB or larger - bake in seconds.

As our optimization efforts progressed, it became clear that the container runtime had emerged as a major remaining bottleneck. This core component is responsible for pulling images and preparing them for low-latency execution. If you’ve never thought about container runtimes before, chances are you’ve been using containerd - the default in Kubernetes.

Our new Fast Container Runtime replaces containerd. The new runtime is fully transparent by design - you can use it with any existing container image, from any registry, without changing your workloads. As shown below, it performs impressively well with unmodified images and often even better with those built using Fast Bakery. Importantly, it also lays the groundwork for future optimizations.

To add to magical experiences - if you are an existing customer of Outerbounds - it is likely that you are already using Fast Container Runtime which we have been rolling out to customers with zero downtime.

How fast is Fast Container Runtime?

Let’s take a look at the impact of the new container runtime in real-world workloads. As shown in the video above, we ran benchmarks on a fresh instance with no cached data - capturing the cold start latency, a scenario frequently encountered in auto-scaling systems. Once images are cached on the instance, subsequent startup times become significantly faster.

We benchmark four typical ML/AI workloads

AWS Deep Learning Image (9.93GB)
The official PyTorch image (3.1GB)
A minimal PyTorch environment created on the fly with Metaflow’s @pypi, including numpy and pytorch (2.7GB)
A typical ML environment created on the fly with @pypi, including altair, xgboost, and scikit-learn (0.65GB)

The chart below visualizes the performance of Outerbounds (yellow) to open-source Metaflow running @kubernetes with EKS (blue) and Metaflow running on AWS @batch (purple):

Unsurprisingly, task launch latency strongly correlates with image size. Across all workloads, Kubernetes has a slight edge over Batch, both of which use the stock containerd. The difference between the services is minor compared to the typical 4x-10x speedup on Outerbounds, powered by Fast Container Runtime.

Interestingly, the relative speedup is highest - about 10x - with the smallest image. In this case, Outerbounds benefits from a minimal image built by Fast Bakery compared to Metaflow setting up an environment on the fly on EKS and Batch. This shows that the benefits apply not only to large ML/AI environments but also to bread-and-butter tasks that use common small Python libraries.

Consider a nightly flow running 2000 small tasks like this - a typical Metaflow workload. Shaving off 30 seconds from each task saves up to 500 instance-hours monthly, or $4,500 annually on tasks occupying an m5.4xlarge.

Much time is wasted in downloading images

A key culprit for slowness on standard setups, e.g. on vanilla EKS, are container registries and containerd, which are not optimized for maximum throughput. Below, we compare three images hosted on ECR, DockerHub, or Outerbounds:

The Bert and PyTorch Vision are typical ML/AI images based on the official PyTorch image. The SciKit image adds numpy, scikit-learn, and pandas on top of the official (slim) Python image.

In our earlier Fast Bakery article, we highlighted how traditional registries can be sluggish when uploading images. This benchmark completes the other side of the equation: they are also slow when it comes to downloading images and setting up a container runtime, which is arguably an even bigger issue, since the cost compounds with every new instance.

We are not alone with this observation. Initially, we were excited to experience “ultra-fast boot times”, featured in the linked article, provided by an optimized image format eStargz. Sadly, for typical ML/AI workloads, eStargz poses two problems:

It requires a costly image conversion - more than 10 minutes for the AWS Deep Learning Image.
While boot times improved in our tests - thanks to lazy loading of image layers - the total task execution times still didn’t show a significant improvement. The cost was merely distributed over a longer time period. We didn’t observe significant benefits compared to Fast Container Runtime which is able to set up full images quickly.

Outerbounds has the advantage of managing the entire task lifecycle - from image building to execution. This end-to-end control has allowed us to systematically eliminate bottlenecks, accelerating both existing images and newly baked environments. And we're not done yet - stay tuned for more optimizations.

Seeing is believing