We recently had a fireside chat with Shreya Shankar (UC Berkeley) about Operationalizing Machine Learning: Patterns and Pain Points from MLOps Practitioners. The conversation revolved around Shreya’s team’s recent paper Operationalizing Machine Learning: An Interview Study, and what they discovered about the common practices & challenges across organizations & applications in ML engineering.
We covered a lot of ground so we wanted to share some highlights here. The following are rough transcripts edited for legibility. We’ve included clips of the relevant parts of the conversation and, if you’re interested, you can watch the entire conversation here:
The 4 main tasks in the production ML lifecycle
Hugo: What are the main tasks you discovered that people do in the production machine learning lifecycle?
Shreya: I think a good answer to this is what the textbook says and then what we found that is different from what the textbook says. And what the textbook will tell you for machine learning is first you collect data (step one), and then you train a model (step two). Step three is to validate that model on a holdout data set and ensure there’s no overfitting. And then step four is you deploy. And what we found is that we can still categorize it into four steps. And maybe the data collection part is similar except that it’s more of a loop like every, I don’t know, a week or so we want to collect new data.
But the rest of the steps last three steps are totally different. The second step, what I said before model training is actually experimentation. In general, whether it be training new models, whether it be trying to source new data, or adding new features, there are a lot of ways you can think about improving a model. And a lot of the participants actually preferred to look into finding new data that gave new signal or making features more fresh instead of stale features that they had before. So that’s kind of step two, stage three in the process we call evaluation and deployment. So evaluation is not a one-and-done thing. What happens is, evaluation is done maybe on a holdout data set first, and then it’s deployed to a small fraction of users. And then when the model shows a little promise, then increasingly it’s deployed to more and more users. As we learn more about what it can do, what it can’t do, what failure modes exist, how do we go and catch problems until we’ve kind of gotten to the full population?
So the key takeaway is evaluation is not a one-time thing. It is a loop on evaluation and deployment of multistage deployment. And then the latest step four, which we found was this overall monitoring and response stage, which was when you do have these models in production. What is their live performance? If you see the performance dropping, what are the bugs? Where are the bugs? How do we respond to them quickly, whether that be actually trying to go do root cause analysis, or simply retraining the model, there is a stage around making sure that there’s little downtime for these services. So we do have those four stages shown in the first figure in the paper. And it was interesting to several of us authors, that they don’t match the textbook, I think that’s like kind of our tip that we want people to take away.
The 3 factors that determine the success of ML projects
Hugo: So something I’m really excited about is you’ve identified three key properties of ML workloads and infrastructure that dictate how successful deployments are. What are they and why are they?
Shreya: We have these three V’s they’re not even related, or we didn’t even know about the big data three V’s: these three V’s are velocity, validating, and versioning. And why did we come up with these three V’s? We wanted some way to explain kind of best practices and pain points, as we were looking for patterns in what our interviewees said. And since we asked such open-ended questions, they’re in the appendix. But we ask open-ended questions like tell me about a bug you had last week, something that caused you things like that are so open-ended, they’re so hard to extract patterns from. So it helped us to come up with these variables like velocity, people kept mentioning that they needed to iterate quickly on experiments, because they had a large frontier of ideas to try, and they wanted to see something that would give a production.
At the end of the day, we were like, oh, when people are doing experimentation, they care about velocity. And it really resonated with us, when we started thinking about MLOps tools. What makes an ML ops tool successful? Well, experiment tracking is a nice space because it really 10xes your experimentation velocity. Now I don’t have to go copy and paste into Google Sheets. And maybe that works if I’m the only person working on my model. But at the moment that multiple people are working on an ML pipeline or model system, then it’s super nice to centralize all of the experimentation we do so we can share the knowledge that we’ve had. So we had velocity for that. For validating early, a lot of people complained about the fact that in their organization, either too many bad models made it to production, or so that was like it was validating too late, or models were validated way too early, and that they couldn’t get anything to production. So for one example, in an autonomous vehicle company, the cost of deploying a bad model is so high. So they incorporated all these checks, they made evaluation take much longer, they decrease the velocity, and engineers were grumpy.
But at the end of the day, there’s a quote in the paper that says you know, we’d much rather get the velocity if it means that we don’t get failures on the road. So again, different tasks, they have different priorities where and I think that’s also why people keep talking about how like machine learning, like, you know, it’s not even generalizable, it’s so different for different tasks. And when you think about it through the lens of these V’s, it’s not that it makes total sense for different tasks, they just have different priorities. Some people prioritize velocity over validation when the stakes aren’t so bad if there’s a failure, for sure. So in that sense, we really liked this kind of framework of evaluating tools, evaluating what people cared about. And as people who like to build tools ourselves. There are some cool ideas that I’ve had that now I can confidently say that, Oh, this is really not 10x improvement in people’s workflows, it doesn’t really help their velocity doesn’t validate better. And it doesn’t help people manage any more versions. So why bother? And I really liked that way of thinking about it.
Why data scientists love and hate notebooks: velocity and validation
Hugo: This is incredibly useful. I mean, this framework. So what I’m hearing is in this framework of velocity, validation, and versioning, we can look at people who prefer Jupyter Notebooks. And in this framework, they’re essentially prioritizing velocity. Whereas people who are strongly opinionated against Jupyter, notebooks, prioritizing validation and versioning, or mostly validation?
Shreya: I think of it as validation more, because it’s like, how do you make sure that development and production environments are as similar as possible? So you can remove the need to validate a lot when promoting from dev to prod, if there is no real environment change, from dev to prod. One great example is like sometimes people will iterate locally, and then deploy to the prod service on the cloud. That is a huge environment mismatch! So you need to do some sort of big validation. I don’t even know I don’t think people have solved this problem of like, making sure there aren’t bugs, this mismatch of environments.
Is the premise of data-centric AI flawed?
Hugo: Could you speak to what you’ve seen with respect to the differences between the data-centric machine learning paradigm, and the model-centric, and how this comes back to your point about what’s taught in textbooks and courses?
Shreya: Courses are now talking about data-centric AI. But the premise is flawed. It’s the same premise that people talk about model-centric AI and that by hook or by crook, we will edit model hyperparameters. Until we get something that works on a small validation set, people are doing the same ethos when it comes to data-centric AI. By hook or by crook, we will add three or four examples or remove six of these labels, or clean 12 of these labels. And we will get 1% or 5% better performance on the validation set. This is the same ethos like maybe this is easier to do in the data-centric sets. But it doesn’t, it’s in the model-centric sense. But we found in the interview study that this is not at all the way to get validation, right, you want to get a win that lasts beyond the initial validation. And we talked about this in the section around experimentation, where you want to find ideas that lead to huge gains in the first offline validation because there are diminishing returns in successive stages of deployments. Like in the offline validation stage, if you get a 15% booster I don’t that’s kind of odd. But if you get like 5% in the third stage of deployment, later on, it’s only going to be half a percent. So account for these diminishing returns as you go down all of a sudden, now that changes the way you think about your experiments, what can I do to bring long-term gains, right? It’s not about editing the view of data that I’m training my model on, I want to add a new signal, I want to go find a new data set that will add a new signal to the model. I want to fix engineering problems. I want to add data validation. So I don’t train or retrain on corrupted data. Like these are the big wins that give you the long-term boost right over though, I hesitate to like preach about data-centric AI in the way that it’s taught.
ML engineering vs traditional software engineering: similarities and differences
Hugo: Something we’ve been dancing around is software engineering, classical, traditional software engineering, and machine learning engineering. Are they the same or different?
Shreya: Okay, so I guess this is a nice preview for my normconf talk: it’s going to be on all of my machine learning problems or data management problems? I think so maybe that’s how I feel about any kind of engineering. And I don’t think it’s like software engineering, per se, that is really like the skill set that an ML engineer needs to have if they want to be a 10x to use the stupid term 10x. But I think it is an understanding of how data works. What is a data pipeline? What is a table? What is a relation? Okay? And what is a view this, this is a nice one, a lot of people don’t even know about views. A view is kind of, I run some query on a data set, I just store that as a view that will either be materialized before I query the view, or it will be materialized as I query the view. Okay, so there’s like a question of like, when do I materialized to view, this is the same problem in machine learning, if you think of a machine learning model, as the view over the underlying trade underlying training data, okay, when I train the model, that’s when the view is materialized. So all of the problems around view staleness are the same thing. There’s a model staleness, okay, we don’t want to train, we don’t want to compute the view on wrong data. We don’t want to train the model on incorrect data. These are all the problems that we had talking about time and time again, and databases that are showing up in the ML world. So in that sense, I think ML engineering really, really is just like recast data problems.
If you can reason about data, you can reason about machine learning
After our fireside chats, we have async AMAs with our guests on our community slack. A lot tends to happen there so join us if you’re interested in such conversations!
This week, for example, Hugo asked
At this point in our chat, you said something I haven’t stopped thinking about since: “my argument is that you definitely know how to reason about machine learning if you know how to reason about data.” What does reasoning about data and ML actually look like and what are the top 3-5 things to keep in mind when reasoning about both?”
And Shreya responded as follows:
Love this & am still working on refining this argument. Here’s the outline:
- end users interact with production ML systems in the form of queries over some existing relation(s), e.g., is this transaction fraudulent
- in database speak, a view is a query over existing relation(s) that you want to materialize, so it’s easier to return results to the end user
- so ML models can be thought of as views, accessible and/or compressed representations of a relation for easy querying in the future
- Classic problems in view maintenance involve staleness and correctness. How do I efficiently update my view when the base relation(s) update? How do I choose which views to materialize? How do I rewrite queries to use the views? All these problems apply to production ML.
- The most immediate analogy between data engineering and ML engineering is around correctness. We have many checks in place & SLAs to make sure the results of recurring data pipeline results are fresh and correct. On-call rotations ensure timely response to failures. I’d argue that we should treat the maintenance of prod ML models similarly.
- Top things to keep in mind: establish SLAs on model staleness and correctness & human-centric processes to ensure them. A bug can usually be traced back to incorrectness or staleness, where the fixes are ensuring well-formed features or retraining.
Join us on slack for more such conversations and also join us for our next fireside chat: How to Build an Enterprise Machine Learning Platform from Scratch with Russell Brooks (Realtor.com). We’ll be discussing what building an enterprise ML platform from scratch looks like in practice, including the journeys experienced at both OpCity and Realtor.com, where he took both organizations from a bus factor of 1 to reproducible and automated ML-powered software. You can sign up here!
Start building today
Join our office hours for a live demo! Whether you're curious about Outerbounds or have specific questions - nothing is off limits.