Large Language Models, Business Value, and Risk

We recently had a fireside chat with Federico Bianchi (Stanford) about where the field of large language models is today and what we can expect to see in terms of business value, once the hype dies down.

We covered a lot of ground so we wanted to share some highlights here. The following are rough transcripts edited for legibility. We’ve included clips of the relevant parts of the conversation and, if you’re interested, you can watch the entire conversation here:

What even are Large Language Models?

Hugo: What are large language models and what are they capable of?

Federico: There are different kinds of language models: we have large and even larger language models. The easiest way to explain what a language model is is a tool that does next-word prediction as a task. So given a sentence, or even a question, what the model is going to do, is going to add novel words that are predicted on the basis of the sentence you started with. So the model is going to output the most likely most probable sentence… even questions. So this is a very simple kind of process but you can do many things with this kind of next word prediction task.

You can think, if you ever tried one of these Chat-GPTs, for example, the idea behind it is that when you ask a question, the model is going out with new words that make sense. And people have built tons of cool things. And I think that our thoughts of application of this kind of technologies… question answering is, indeed, a very cool application, you can use this kind of tool to ask questions and get an answer on many different topics, you can use them as, sparring partners, to learn about new things.

So you can use them to prepare for interviews or to discover new topics. But also different kinds of applications, for example, one large language model that I use a lot is Github Copilot which does next-word prediction for, let’s say, Python code, at least for me, so it helps you to write code. It actually increases my productivity a lot.

Hugo: It was nice that you talked through why we can get things like Chat-GPT because, if you have next-word prediction, it isn't initially obvious why you can then have Question and Answer pairs, right?

Federico: Yeah. It requires a bit of thinking to understand why this next word prediction thing is the foundation part of this question and answer pipeline, but even also, like of asking a model to write emails or asking a model to summarize documents, is still this kind of next-word prediction task. There is obviously light modification to the training pipeline of these models, but the basic component is still this kind of next-word prediction.

Hugo: Digging a bit deeper, maybe you could say a bit about what's happening in the backend. I mean, we have the neural networks and deep learning of a particular form, we hear the term transformer kind of thrown around…

Federico: Of course, so the transformer is an architecture that was introduced in 2017 originally for machine translation.. This was a broad shift in the NLP community. And the general idea is that the transformer is an architecture that takes input sentences and through the use of self-attention, that is a component that helps the model to look at all the different words in a sentence through many layers of like that will be reflected, the model can build up kind of an understanding of how all the words relate to each other.

Through this kind of pipeline, the model looks at the different words in the sentence to understand what comes next. It's a very efficient computational unit that is very useful for this kind of task. I think one of the biggest contributions of the transformer architecture is its efficiency in doing computation. At least this is the most valuable contribution to the field as we have LSTMs before towards the late teens. But they had the limitation of a rolling kind of characteristic of the network, because they were sequential, and they could not be processed in touch. So transformers bring an efficient aspect to the computation of sentences that completely changed the way we do NLP.

Managing Expectations of LLMs

Hugo: What aren't LLMs capable of?

Federico: They cannot be trusted, in general, for the answers they give. So there is a concern regarding how factual the information they generate is. For example, the first time people asked Chat-GPT for references about topics they were interested in and Chat-GPT started making up references about things. And that's a huge problem because it gives a sense of confidence in the answer by saying you can look up this reference to trust what I just said. But that reference does not exist. So there are issues with that way like the factuality of what the model says. And that's kind of a big issue for many use cases.

Because if you don't, if you can't be sure about the output of the model, you cannot trust it in most of the settings… or complex reasoning tasks might still be out of the domain, we will need to see a couple of more things when the entire GPT-4 is going to be released, for example, the one that also combined the multimodal part that is not yet out. And also like unclear at least for GPT. There has been some concern about the actual evaluation. So GPT has been published with some evaluation regarding, for example, the ability to solve coding problems. But there have been some concerns that the model has been trained on the test set that was actually used to evaluate it.

So it's sometimes not clear how much we can trust the motivation. There might be some concerns about how well it doesn't do specific tasks. So the biggest question I'm getting is how do we undo the failure with failure cases? And the other kind of like, limits or is not really an issue with capability, but something that we are probably going to talk about is that there are cases in which you can not consider the language model safe. Because people have found ways to hop into the language model and make it out or like figure out two things that the original developer tried to prevent in the model, such as toxicity.

Hugo: So just to recap, the generating factual information is incredibly difficult. Complex reasoning is tough. How to handle failure cases, also that we shouldn't consider it safe. … there's the serious issue of toxicity.

How to think about the space of LLMs, given there are so many!

Hugo: We've been talking about Chat-GPT but there are so many other LLMs out there. Can you give us a rundown of the space of LLMs and how to think about it?

Federico: The first thing to say is that most of these things came out in the last three months.

Hugo: That's so that's one of the wildest parts of it, right?

Federico: I'm also not able to keep up with all these things. It's very difficult to keep track. And that is why I think we need to wait a couple of months, probably to understand what is gonna stick? And what are different ways to think about the entire space? One is in terms of open and closed models. Chat-GPT, for example, is a closed model. And you cannot access it, you need to call it by using Open API's API. And that might be something you don't want to do, but it’s high quality. So that's definitely an advantage of the model. We have right now open source alternatives.

So things charge up that was like around December 2022. There has been kind of like a wave. Very strong way to generate competitors to Chat-GPT that are indeed open source. There has been a lot of work in generating alternative language models. For example, one is Llama, that is the language model that was trained by Meta a couple of months ago. I don't remember when it came out. But people have started building Chat-GPT-like on Llama. And right now we have like, I think I've been Facebook chat is based on that. We have GPT.-J, that is a bit older, I think it's a couple of years older, I guess, that has been built by Eleuther AI… and again, it's open source. And the good thing about this model is that is also it has a commercial license.

So yes, it can be used by people, Dolly, that is a very recent model has been trained using GPT-J, and has been fine-tuned on instructions. And what does that mean? It means that, when you train a large language model and just the next word prediction tasks, you end up with something that is not as useful as it could be, what you want is a model that kind of behaves well when you give it instruction. So a model like Dolly, or a fine-tuned instruction model is fine-tuned using instructions that are like questions and answer, or, for example, training a sample for this model, or summarize this text in the text. And the output is the summary. Given some example of this kind of task, the model learns to do task in a much better way. So when you actually deploy it, it works very well to do.

How to Get Started with LLMs Today

Hugo: So I presume people know how to go to OpenAI and start interacting with chat-GPT. Right. But I'm wondering for people, you know, programmers and scientists and data scientists and ml people who maybe haven't had a lot of experience with it, how can they get started with playing around with large language models today?

Federico: There are all kinds of different levels of abstraction in terms of how much code you're willing to write and how much time you have to put into understanding what's going on behind the scenes, I guess because then I will see like different methods to interact with a language model. There are different kinds of pipelines to train a language model if you have data. So I think like the main reserves, if you want to understand a bit what's going on without like, too deep into the kind of entire difficulties check out HuggingFace in which you can go through the fine-tuning of a supervised language model, for example, classification tasks, that's going to give you like a very good intro on how to actually add your own making small, large language models, on your own laptop doing things.

And that could be already good enough to do many different tasks. It's very controllable, you can do whatever you want with it. And it's usually run on consumer hardware. Most of the time, it might be slow, but at least it's enough to start playing with it.

Right now, with all these huge developments that are a lot of packages, that wrap APIs for you that are coming out, for example, there is Langchain that is kind of an abstraction that you can use to seamlessly let me use multiple language models, without having to delve into different kinds of setups, different kind of systems. So Langchain is another thing that I think it's like, very interesting to look at. It's empirical. Yeah, like, it's a good engineering effort to make language models accessible to the broader audience. And finally, if you what to maybe write some code that is mostly API, I think that using Cohere’s, or like OpenAI’s APIs are still an effective way to work with language models.

As you can prompt for questions asked for clarification, there is an entire field that is called prompt engineering which is the art of asking questions to language models, that is becoming more and more relevant. There are courses coming up that are that have been like, advertise everywhere on most social media platforms. And this is the direction of like, talking directly with the model using, for example, API's.

Productionizing LLMs and incorporating them into pre-existing software stacks

Hugo: How do you actually productionize LLMs? And how do you incorporate the wisdom of LLMs into existing software stacks?

Federico: Yeah, I think currently, the best way to think about this is not to use LLMs as core components of departments as the field is moving so fast. We just cannot always trust these pipelines. And even as a researcher, we are struggling to understand this, in terms of like how they work, what I would, what I think it's it's really useful is to make use them as kind of like analytics tools, for maybe your data, or even classification tools in which you can use them. Like I don't know, if you have a conversational agent, you could use one of these larger language models for intent classification. And you can do predictive analytics, you can do many things, the things I think, like you can replace many steps of a pipeline without language model… than you can use them for control flow, you can use them to do many things.

One thing I think, like we are probably going to talk about next that I think it's very, very important that I want to mention it now is that we should treat LLMs as we treat software in the sense that we have QA to test software, or like to stress test, and we should do the same language model. So there should be someone in the team that tries to break the part of the pipeline we built with LLMs, because we know that there might be an issue. And while software is often rule-based, so we can come up with reasonable flows or things that can happen, this is not true with language models that are by nature stochastic so they can come up with kind of different replies.

And this kind of product evolves over time. So the original Chat-GPT is not the Chat-GPT we have today, if you use the original Chat-GPT API, and you update it, you might not have the same result. So that could be like an app-breaking update in your current stack.

Generating Business Value with LLMs

Hugo: I am interested in thinking through… at least initial thoughts on how LLMs can go beyond proof of concept and be used to create sustainable business value. Do you have any thoughts on that?

Federico: I think we need to wait a bit to see what’s going to come up in the next couple of months. What I see are cool applications, obviously. But then we don't know how well they are going to generalize. I still think that like, some of these things are already useful, like a large language model for classification. It's already very possible that they can improve a lot of aspects in your conversational pipeline, for example, that's definitely something when I see that I'm not sure what's going to come up next is more like the front end, start with a conversational AI model by itself backed by Chat-GPT. that is something I still have some restraint on saying that it's gonna be the next application, because of the fact that there is no regulation… there are a lot of issues with respect to safety.

We are also building low-scale models that everybody can now use and implement. So the problem a couple of months ago was that the only thing that was out there was OpenAI. And it was big, you have to pay for it, and you could not use it on your own platform. What we are seeing in the last couple of weeks and months is that a lot of models can run on kind of consumer hardware and be used and loaded into your own software…here is obviously an analysis of the costs of having this kind of platform running. But still, I think we are moving also in a possible direction in which you can build your own small LMMs-based business platform that you can use to generate value.

Join us for more chats!

After our fireside chats, we have async AMAs with our guests on our community slack. A lot tends to happen there so join us if you’re interested in such conversations!

You can join us on Slack for more such conversations and can view the other fireside chats here.

What even are Large Language Models?​

Managing Expectations of LLMs​

How to think about the space of LLMs, given there are so many!​

How to Get Started with LLMs Today​

Productionizing LLMs and incorporating them into pre-existing software stacks​

Generating Business Value with LLMs​

Join us for more chats!​

Smarter machines, built by happier humans