Default Kubernetes configuration can be suboptimal for ML workloads, resulting in wasted human and computer time. We show how a round of troubleshooting and a two-line change in the code yields a nine-fold speedup in the total execution time.
MLOps for Foundation Models with Metaflow
OpenAI’s Whisper is a powerful new multitasking model that can perform multilingual speech recognition, speech translation, and language identification across multiple spoken languages. In a previous blog post titled MLOps for Foundation Models: Whisper and Metaflow, Eddie Mattia discussed the use of Metaflow for running OpenAI Whisper for transcribing Youtube videos. It covered the basics of Whisper, how to write a Metaflow flow and also briefly touched upon how to run such flows at scale in the cloud. Although here we’re focusing on Whisper, all of this work is generalizable to many types of foundation models (see our work on Parallelizing Stable Diffusion for Production Use Cases, for example).
To deploy and run their production workloads, many enterprises look to Kubernetes as it has become the de-facto way of running applications in the cloud. Its cloud-agnostic APIs (to the user), declarative syntax for resources, and a huge, open-source ecosystem make it a very attractive platform for a number of use cases. On this note, we decided to run OpenAI Whisper using Metaflow on Kubernetes.
However, out of the box, the performance of the system can be quite bad for ML workloads. But, with a little troubleshooting, we were able to identify the bottlenecks and increase performance dramatically. The result is a production-ready, sufficiently performant ML workflow that has the capability to scale out across multiple dimensions.
Whisper Model Types
OpenAI Whisper has multiple machine learning models of varying sizes. These were created using different sets of parameters and support different languages.
Consider the case where a team wants to evaluate the relationship between the size of the ML model, the time for transcription and the actual quality of the output. For a few different inputs, the team would want to run Whisper using different model sizes and compare the results. Metaflow makes it easy to populate results across these dimensions, so you make informed decisions about this tradeoff without writing extra code.
For 3 different Youtube urls and for evaluating the tiny and large models, the evaluation would look something like below:
Each of the circles is a step in Metaflow. transcribe is called for each URL in parallel using Metaflow’s foreach construct. Each transcribe does another foreach for tiny and large models.
The actual transcription happens in the tiny and large steps which use Whisper’s tiny and large machine learning models respectively.
The join step performs any post-processing specific to a given url. The post-join step can perform the actual evaluation of the quality of output, the time taken, etc.
The source code for this can be found here: https://github.com/outerbounds/whisper-metaflow-k8s
For this particular post, we used 3 videos from the Youtube channel BBC Learning English for evaluation.
To begin with, the flow was run locally on an M1 Macbook Pro having 64GB of memory. This was mainly to confirm that the code executes correctly and the output is in the expected format.
Thanks to Metaflow, running the flow is as simple as executing:
python3 sixminvideos.py run
This run completed in approx. 10 minutes (see below) which is impressive considering that 6 different transcriptions happened in parallel during this time. Note that OpenAI Whisper downloads the machine learning model and caches it locally. In this case, these models were already cached locally. Otherwise, the time would have been longer.
Whisper Models with Metaflow on Kubernetes
Let’s run the same flow on Kubernetes. See these instructions for more setup and details. In our setup, the Kubernetes cluster was set up to run on AWS and comprised two m5.8xlarge EC2 instances.
The simplest way to run the flow on Kubernetes is to run the same CLI command as above followed by –-with kubernetes. Behind the scenes, Metaflow connects to the Kubernetes cluster and runs each step in the cluster. So, running the flow -–with kubernetes resulted in each step running in the cloud.
The smallest executable unit in Kubernetes is a pod. In this case, Metaflow takes the code from each step, puts it into a Linux container, and runs every task corresponding to the step as a Kubernetes pod (a step may yield multiple tasks with different inputs in a foreach).
Here’s what an execution on Kubernetes looks like:
python3 sixminvideos.py run --with kubernetes
The flow completed successfully. However, the image above shows that it took 48m15s. That’s almost five times slower than the time it took to run this flow on a local laptop. The promise of running flows in the cloud is to get a performance boost, not a performance penalty!
The source code of the flow was the same in both runs. Looking at the start and end times of the steps, it is clear that Metaflow orchestrated the steps correctly in both the runs. The step order was maintained and steps that were supposed to run in parallel, indeed ran in parallel.
We know that the comparison between the local run on M1 Mac and the run on Kubernetes is not entirely fair, as each pod in Kubernetes lacks shared storage. As a result, models were re-downloaded every time the pods started, adding some overhead. Also, if our Docker image that contains dependencies such as ffmpeg is not cached on the node, it needs to be downloaded adding further overhead.
Still, the fivefold slowdown doesn’t feel right. Could there be some misconfiguration in Kubernetes that caused the flow to take this long? Let’s find out!
Analyzing workload performance in Kubernetes
Let’s dig deeper into what might be the bottleneck for this performance degradation.
CPU and Memory consumption
When creating a pod, Kubernetes allows the user to request for a certain amount of memory and CPU resources. These resources are reserved for the pod at the time of creation. The actual amount of memory and CPU used could vary over the lifetime of the pod. Also, depending on your Kubernetes configuration, a pod could actually use more than the requested amount of resources. However, there is no guarantee that the pod would actually get the resources when they exceed the requested amount.
The following charts show the relationship between the resources requested and used by each pod in this flow (these charts were obtained using this Grafana dashboard running in the cluster).
In the image below, the green line indicates the pod's resource request and the yellow line indicates the actual resource usage. The Y axis shows the number of CPU cores fully utilized by the pod.
Turns out that all the pods in this flow requested for 1 CPU but actually landed up using much more, almost 13 cores.
In this chart, the Y axis shows the memory request (green line) and vs. the actual memory usage (the yellow line).
In the case of memory, the pods were requesting 4GB of memory but landed up using almost 12GB of memory.
A Kubernetes cluster comprises multiple servers (physical server machines, VMs, EC2 instances, etc.). These are called nodes. When a pod is submitted to Kubernetes, it chooses one of the nodes in the cluster to run the pod.
As you can see in the following screenshot of stdout, all pods for this flow ran on the same node: ip-10-10-27-207.us-west-2.compute.internal
Observations
The key problem is that while the run is able to complete successfully, Kubernetes decided to schedule all tasks on the same node, resulting in suboptimal performance.
In the chart below, we can see the two nodes of our cluster. The node depicted by the yellow line has its CPUs fully utilized at ~100% while the green node is idling. Correspondingly, no memory is used on the green node.
In this case, tasks were able to burst above their requested amount of CPU but due to co-location on the same node, they were competing on the same scarce resources. This is a clear demonstration of the noisy neighbor problem that often plagues multi-tenant systems.
Confusingly, the scheduling behavior depends on the overall load on the cluster, so you may experience a high variance in the total execution time, depending on what other workloads happen to execute on the system at same time. Variability like this is hard to debug and clearly undesirable, especially for production workloads that should complete within a predictable timeframe.
Issues like this are not immediately obvious on the surface, as the run completes eventually without errors. These types of issues can easily go unnoticed, resulting in wasted human and computer time and increased cloud costs.
Fixing the Performance Issues
Metaflow supports providing specific resource requirements to individual steps. These requirements are written in the code as decorators. They are translated into CPU and memory requests in the Kubernetes pod specification. Instead of relying on opportunistic bursting, we fix the resource requirements to reflect the actual usage.
- The tiny steps can use 8 vCPUs and 2GB of memory
- The large steps can use 22 vCPUs and 12GB of memory.
So, the Metaflow flow was set up accordingly and run again.
...
@kubernetes(cpu=8, memory=2048, image=”https://public.ecr.aws/outerbounds/whisper-metaflow:latest”)
@step
def tiny(self):
print(f"*** transcribing {self.input} with Whisper tiny model ***")
cmd = 'python3 /youtube_video_transcriber.py ' + self.input + " tiny"
p = subprocess.run(shlex.split(cmd), capture_output=True)
json_result = p.stdout.decode()
print(json_result)
self.result = json.loads(json_result)
self.next(self.join)
@kubernetes(cpu=22, memory=12288, image=”https://public.ecr.aws/outerbounds/whisper-metaflow:latest”)
@step
def large(self):
print(f"*** transcribing {self.input} with Whisper large model ***")
cmd = 'python3 /youtube_video_transcriber.py ' + self.input + " large"
p = subprocess.run(shlex.split(cmd), capture_output=True)
json_result = p.stdout.decode()
print(json_result)
self.result = json.loads(json_result)
self.next(self.join)
...
With this setup, the flow completed in 4m 58s
Looking at the node CPU utilization, it can be seen that this time, tasks were run on multiple nodes. Specifically, the CPU was not pegged at 100%.
Conclusions
After peeking under the hood, we were able to reduce the previous completion time of 48 minutes to approximately 5 minutes, resulting in an impressive 9x improvement in performance, simply by adjusting two resource specifications.
The optimization of the run has numerous benefits, including the ability to conduct more experiments. Instead of being limited to testing only tiny and large models, it will now be possible to test various model sizes by adding more parallel steps, each of which will run as separate pods with their own dedicated resources. Additionally, the flow can now transcribe a larger number of videos, instead of being limited to just 3.
Next Steps
If you want to deploy Metaflow and Kubernetes in your own environment, you can do so easily using our open-source deployment templates. You can customize the templates as needed and hook them up in your existing monitoring solutions, so you are able to debug issues as illustrated above. We are happy to help you on our Slack channel, should you have any questions.
If you want to play with the Whisper model with Metaflow by yourself, you can do it easily for free in our hosted Metaflow Sandbox, which comes with a managed Kubernetes cluster – no infrastructure expertise required! We provide a pre-made workspace specifically for Whisper, containing all dependencies needed. Simply click the Whisper workspace at the Metaflow sandbox homepage.
Finally, if you need a full-stack environment for serious ML workloads but you would rather avoid debugging Kubernetes and optimizing workloads manually altogether, take a look at Outerbounds Platform. It provides a customized Kubernetes cluster, optimized to avoid issues like the one described here, deployed on your cloud account, fully managed by us.