Skip to main content

Training a Large Language Model With Metaflow, Featuring Dolly

We show how you can train your own large language model (LLM) easily with Metaflow, using Databricks’ Dolly as a representative example. By using Metaflow to handle LLMs, you can incorporate them into your existing production projects, moving beyond proofs of concept.

In our previous article, Large Language Models and the future of the ML infrastructure stack, we presented this illustrative schematic about the future of ML-powered applications which seamlessly mix foundation models and traditional ML:

This article dives deeper into a specific technical topic: How to finetune an open-source foundation model using open-source Metaflow today? As a representative example, we will use the Dolly model that was recently published by Databricks, but the approach presented here applies to other current and many future models as well.

The source code for training Dolly is freely available as a set of straightforward Python scripts, so what’s the point of using Metaflow to orchestrate the training? The reason is simple: In a world where LLMs are treated as models amongst other models powering complex applications, as depicted above, you want to leverage the usual benefits of Metaflow:

  • The ability to use the best modeling approaches and libraries for the job, supporting the quickly evolving library ecosystem around LLMs, but using them in a stable, secure, and production-ready manner.
  • Flexible deployment patterns to support various applications.
  • Built-in versioning of data, code, and models, which is especially critical with complex models like LLMs whose behavior needs to be tracked closely.
  • Easy orchestration of sequences of workflows both locally and in production, making them the backbone of the overall system architecture.
  • Seamless support for scalable cloud compute, including powerful GPU instances.
  • Consistent and fast access to data, even large amounts of it.

With this motivation in mind, let’s dive into technical details.

A brief history of open-source foundation models

Over the past six months, a diverse garden of variously fine-tuned LLMs have emerged based on a few foundation models. This rapid evolution is driven by the fact that training a new foundation model from scratch can cost anywhere between $300k-$1M using cloud TPUs or GPUs. In contrast, fine-tuning a specialized version of a foundation model is many orders of magnitude cheaper. For instance, the Dolly model used in this article can be trained in a few hours, costing around $500-$1k.

The following diagram illustrates the sprawling lineage of these models:

Since the original Dolly, Databricks has already followed with Dolly 2.0, which is based on a different model and makes Dolly 2.0 commercially usable by using an internally curated fine-tuning dataset. Both Dolly versions are derived from a source model built by the team at Eleuther AI. In the case of the first Dolly, the 6 billion parameter model is called GPT-J, where Dolly 2.0 is derived from a 12 billion parameter model called pythia.

Even more recently, released StableLM, another foundational language model which fits a similar mold as the GPT-J model that Dolly leverages. There are already several models you can get started with today that finetune StableLM for more specific tasks, as Dolly trains GPT-J.

These examples highlight the latest waves in LLM workflows, which have revolved around the instruction tuning approach to make the models increasingly suitable for prompting, as popularized by ChatGPT. Naturally this is not the only use case for LLMs but it is showing to be useful for applications based on question-answering.

Improvements in instruction tuning models are in a wild-west era, as people explore new techniques and approaches for scaling these workflows. In any case, the high turnover rate of these models points to the importance of understanding where these models come from and what infrastructure choices organizations can make now that will support any of these new modeling approaches.  

Dolly’s supply chain

The user experience of Dolly demonstrates how far the machine learning community, in particular services like HuggingFace, have come in making sophisticated modeling APIs accessible to broad swaths of developers. We start by loading the pretrained source model:

model = transformers.AutoModelForCausalLM.

Then, Dolly uses the HuggingFace Trainer API to train on an instruction tuning dataset called Alpaca, which was curated by a team at Stanford’s tatsu lab. The most important thing to realize for this post is that the dataset was generated using OpenAI APIs, by prompting the text-davinci-003 model. This means that we are training the model from the original state of the GPT-J model and teaching it to distill the behavior of the larger text-davinci-003 model from OpenAI and mimic its instruction following-capabilities in our smaller Dolly model. The final dataset contains 52K instruction-tuning examples. 

Note that due the dependency on the Alpaca dataset, which has a Creative Commons NonCommercial (CC BY-NC 4.0) license, you cannot use this example or Dolly version 1 as-is in commercial applications and will need to replace the Alpaca data dependency with your own instruction-tuning dataset or use something that has a free for commercial use license.

Training Dolly with Metaflow

To reproduce the Databricks’ team’s implementation, and to demonstrate Metaflow’s applicability to LLM use cases, we trained Dolly in a variety of contexts. You can find the source code in this repository.

We created a simple Metaflow flow that consists of three steps, start, train, and end. One of the superpowers of Metaflow is that we were able to rapidly iterate on the flow locally on a personal GPU workstation. We followed the existing Dolly scripts and used Microsoft’s deepspeed library to distribute training across many GPUs. Deepspeed integrates tightly with PyTorch and HuggingFace code, so this is a general pattern you can use to train large deep learning models.

Our local workstation didn’t have multiple GPUs, so to test distributed training, we annotated the step with@kubernetes(gpu=4) to execute the train step on a cloud instance with multiple GPUs. Conceptually, the flow looks like this:

In the source code, the core parts of the yellow box look like this:

    @kubernetes(cpu=32, gpu=4, memory=128000)
    def train(self):
                “–num_gpus=%d” % N_GPU, 
                “–module”, MODEL_TRAINING_SCRIPT, 
                “–local-output-dir, self.local_output_dir,
                “–per-device-train-batch-size”, self.batch_size,
                “–per-device-eval-batch-size”, self.batch_size,
                “–lr”, self.learning_rate
# push model to S3

When it comes to the code, this is all it takes to train the model!

But does it actually work?

When you start running the flow, it will happily start crunching data on a large GPU instance – and go silent for hours as the training is in progress.

Especially with a new experimental project like this, we weren’t sure if the deepspeed setup is working correctly, utilizing all GPUs efficiently. To address common situations like this, recently we created a Metaflow card for profiling GPU usage. To use it, drop next to your flow file and add the following line in your code:

    @kubernetes(cpu=X, gpu=Y, memory=Z)
    def train(self):

The Metaflow task GPU profiler shows us things such as which NVIDIA devices are visible on the instance, how they are connected, and most importantly how they are being utilized throughout the lifecycle of the model training job.

The GPU profiler automatically logs results to a Metaflow card, so you can organize and version these results with modeling metrics and other experiment tracking data in reports that are conveniently viewable in the Metaflow UI (or in a notebook). Cards are particularly useful during experimentation, as they are permanently attached to the run that produced them, so you can easily observe past performance in the context of automatically versioned code and data associated with the experiment.

Here is an example card showing GPU processor and memory utilization from an epoch of training Dolly:

Seeing these charts gave us enough confidence that the GPUs were performing useful work.

The usual infrastructure headaches

You can run the above Metaflow flow to train Dolly by yourself, assuming that you have infrastructure setup for Metafow. You can reproduce the above results either by using AWS Batch or Kubernetes that conveniently works with all major clouds. Or, if you would rather avoid infrastructure headaches altogether, you can rely on the Outerbounds Platform.

In our initial tests, we used AWS a p3dn.24xlarge EC2 instance to train Dolly for one epoch. The instance this process runs on has 8 V100 GPUs each with 32GB memory. We used AWS Deep Learning AMI, and were able to run the model for one epoch in 3 hours and 15 minutes. This cost roughly $100. Due to unavailability of p4 instances on AWS, we also ran the same software setup on a Coreweave node with 3 A100 GPUs each with 80GB memory. We trained this model for 10 epochs, which took roughly 30 hours. This cost roughly $200.

Besides headaches related to finding large enough GPU instances to run the code, it is crucial to manage dependencies carefully, which is a silent productivity killer in deep learning. For example, when using NVIDIA GPUs, misalignments between CUDA drivers, and CUDA toolkit components such as the NVCC compiler can lead to flow-breaking debugging cycles when working at the data science layer. We found that Dolly worked with the following combinations:

  • Ubuntu 20.04 or 22.04
  • NVIDIA driver version 515.105.01
  • CUDA Version >= 11.1 and <= 11.8

Thankfully, both the open-source Metaflow and the Outerbounds Platform make it easy to version environments, whether the task requiring the environment runs locally on your workstation or remotely in your compute cluster.


Finally, with all our infrastructure in place, we were able to start exploring the bounds of Dolly’s language and reasoning abilities. Dolly can perform a surprisingly broad set of tasks from text generation, brainstorming and open Q&A.

As Databricks notes in the Dolly v1 HuggingFace model card, “the qualitative behavior of Dolly is materially different from the underlying model” despite “Dolly [scoring] only marginally better compared to its base model GPT-J-6B [on common LLM evaluation benchmarks]”. It was instructive to validate Databricks claim above as we observed how its responses tended to change as we asked the model to generate responses to a common set of instructions at various stages in its training.

For example, we asked the model to “Write a story about a person who cannot see the random forest because they are caught up in the decision trees” to see how it handles creative storytelling-style prompts. Notice how the model seems to get better at interpreting a somewhat open-ended and vague prompt, as it starts to emphasize the “Write a story” part of the instruction in its responses as it trains more and more. 

EpochSample Response
1When people think of random forests, they often think of the trees being randomly selected from a dataset and being used to make predictions. What they may not know is that when a random forest is built, each tree is created with a rule to determine the best split for a given feature.
5Anil was lost in a thick forest of decision trees. Every step he took felt like he was forcing his way through an endless series of obstacles. He had been walking for hours, and he was still no closer to finding the trail he had been following. His only hope was that eventually he would stumble upon the clearing that had caused his vision to fail him earlier.
10John was an engineer who couldn’t get past the first layer of decision trees. He had worked so hard to understand the data, but something was just wrong with the random forest he was seeing. He worked his way deeper and deeper into the forest, but there were just too many branches and too many combinations of variables that he didn’t understand. John was completely lost, unable to progress any further and in need of a boost to get him out. That’s when he discovered the random forest again, and it gave him the inspiration he needed to understand the data.

This example illustrates one of our main qualitative observations, that the evolution in responses throughout training shows that with more training the model begins to write more verbose and detailed responses, more similar to the GPT models many are familiar with in products like ChatGPT. This should come as no surprise, as we already discussed that the Alpaca dataset we are training with was generated by querying a GPT model from OpenAI! This is the distillation process in action. 

What next

To train Dolly with Metaflow, you can find our repository here. Also, you can use the new @gpu_profile decorator in any GPU workflows. We would love to hear your feedback for it, so we can develop the feature further.

If you liked this article, you may enjoy our previous posts related to foundation models: Image generation using Stable Diffusion and text-to-speech translation using Whisper, as well as infrastructure headaches that you may experience with them.

If you need help in getting started with Metaflow and/or experimenting with foundation models and LLMs, join thousands of other ML engineers, data scientists, and platform engineers in the Metaflow Slack. If you want to experiment with Metaflow and test fun ML use cases without having to install anything locally, sign up for a free Metaflow Sandbox.

PS. To learn more about LLMs in the context of business use cases, join our next fireside chat on May 4th!

Smarter machines, built by happier humans

The future will be powered by dynamic, data intensive systems - built by happy humans using tooling that gives them superpowers

Get started for free