Skip to main content
Blog

LLMs, RAG, and Fine-Tuning: A Hands-On Guided Tour

We recently presented a hands-on workshop at the Generative AI Summit in Austin on β€œLLMs in Practice: A Guide to Recent Trends and Techniques”. We heard that many attendees found the session valuable, so we wanted to share it more widely. Grab some 🍿 and enjoy the full 2h session here:

In case you don't have two hours to spare, this article gives an overview of the content covered, pointing at specific short clips so you can focus on topics that pique your interest. You can code along in the browser using the Metaflow Sandbox!

We start with easy-to-use but generic AI API calls and progress to increasingly sophisticated approaches which allow you to build more unique and delightful AI-powered systems. We cover

  1. Various proprietary off-the-shelf models and APIs.
  2. Open-source LLMs.
  3. Prompt engineering.
  4. The Retrieval Augment Generation (RAG) pattern.
  5. Vector databases.
  6. Building a constantly updating, production RAG system with Metaflow.

In the end, we scratch the surface of the topic of fine-tuning models, which we have covered in a number of earlier posts. Let’s get started!

The Easy Way: Hitting LLM Vendor APIs​

First, setup Your Sandbox​

You can code along by using the Metaflow sandbox and following these instructions:

  1. Create a Metaflow sandbox here
  2. In the terminal, execute cd .. && git clone https://github.com/outerbounds/generative-ai-summit-austin-2023.git, authenticate with GitHub.
  3. Re-open your sandbox with this link.
  4. When VSCode prompts you to activate a conda environment, choose the pre-selected sandbox-tutorials environment as shown in picture below.
  5. Follow along with the lessons on the left hand side navigation - Suggestion: Use the terminal instead of notebooks to run the code to see more informative print outs!

As always, you have trouble setting up the sandbox, join the Metaflow Community Slack and post your question to #ask-metaflow. We are happy to help!

Hitting LLM APIs​

This section shows you how to get started with the leading commercial APIs as of October 2023:

  • OpenAI
  • Cohere
  • Jurassic-2 from A21 Labs
  • Claude from Anthropic

It’s as easy as getting and setting an API key and executing the following code for OpenAI:

PROMPT = "How is generative AI affecting the infrastructure machine learning developers need access to?"
openai.api_key = ... # your key
gpt35_completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": PROMPT}]
)
gpt35_text_response = gpt35_completion.to_dict()['choices'][0]['message']['content'].strip()

In addition to showing you how to use the APIs, we talk about relative benefits of various vendor APIs (although, do note that you can make each of these models do almost anything, making this all muddy):

Endpoint / APIOpenAICohereClaudeA21
Prompt-to-responseβœ…βœ…βœ…βœ…
Chat-to-responseβœ…βœ…βœ…βœ…
Text embeddingsβœ…βœ…βŒβœ…
Fine-tuningβœ…βœ…βŒβœ…
Language detectionβŒβœ…βŒβŒ
Raw document processingβœ…βœ…βŒβœ…
Rerank / document relevanceβŒβœ…βŒβœ…
Text/image to imageβœ…βŒβŒβŒ
Audio-to-textβœ…βŒβŒβŒ
Moderations / toxicitiyβœ…βœ…βŒβŒ

Working with Open-Source LLMs​

In this section, we get started with OSS models:

  • Why would you want to use open-source LLMs?
  • Will they ever really be competitive?
  • What drives the competition if OpenAI's models are 10x bigger and performance keeps scaling with model size?

We then get hands-on and

  • Explore a collection of HuggingFace models,
  • Load a pre-trained model from the HuggingFace Hub, and
  • Use these models for text classification and text generation, similar to the core mechanism of how the commercial APIs you saw above generate text.

Prompt Engineering​

As a quick follow up, we explore ways to make LLM responses more relevant to end users through prompt engineering.

To set the stage, consider the analog in traditional search engines, such as how Google provides us with a variety of ways to make queries more precise. LLMs are similar in that there are specific ways to write prompts that influence LLM APIs and the downstream products. We cover:

  • how to modify your ChatGPT interface to get it to do more of what you want,
  • how to implement the same approach programmatically with a trending framework called Langchain, and
  • a basic introduction to different fields of research around prompting - centered around "chain-of-thought reasoning".

Better Relevancy with RAG​

In this section, we jump into Retrieval Augmented Generation (RAG) and extend the concept of prompt engineering techniques by

  • searching through your own data,
  • returning the most relevant results using a vector database, and
  • conditioning LLM continuations with these relevant facts.

Using the Metaflow documentation as an illustrative example, we show how you can use your own unstructured data like blog posts, documentation, and video collections to inform LLMs using the following recipe:

  • Chunk the unstructured data
  • Compute embeddings on the chunks using a model
  • Index the embeddings
  • Based on user queries, run vector similarity searches against the embeddings
  • Return the top K most similar vectors
  • Decode the vectors into the original data format
  • Use the "similar" data in prompts

Building Live Production Systems with RAG​

Many RAG examples and demos you can find online show RAG in isolation. In the real-world, these systems need to be connected to the surrounding infrastructure and data, running 24/7.

In this section, we show how you can build a production-ready RAG system using Metaflow. The system reacts to new data, preprocesses it, produces embeddings in parallel, and updates a vector database:

For more details about building AI-powered systems, take a look at our earlier posts about an end-to-end RAG system and the big picture of the ML stack for LLMs.

Fine-Tuning Custom LLMs​

Finally, we give a sneak peek on how you can fine-tune a custom LLM with the goal to get more relevant responses from our LLM. We show you how to

  • Fine-tune a 7 billion parameter Llama2 model (and, yes, we explain what all of these things mean!),
  • Actually doing so using Quantized Low-Rank Adaptation (QLoRA and we also explain this!)

We also provide insight into how to think through hardware considerations. The free sandbox environment is not powerful enough to run fine-tuning, but you can refer to our previous articles, such as this starter template and this more advanced example, to learn how you can do this with Metaflow and the Outerbounds Platform.


If you want to start building production systems like this today, we can get you started with a complete platform for AI and ML in 15 minutes!

Smarter machines, built by happier humans

The future will be powered by dynamic, data intensive systems - built by happy humans using tooling that gives them superpowers

Get started for free