Full Stack Machine Learning, ML Engineering, and SWE Skills for Data Science

Authors

We recently had a fireside chat with Ethan Rosenthal (Square) about Navigating the Full Stack of Machine Learning. The conversation revolved around the wild west of full stack machine learning and how to make sense of all the feature stores, metric layers, model monitoring, and more with a view to deciphering what mental models, tools, and abstraction layers are most helpful in delivering actual ROI using ML.

We covered a lot of ground so we wanted to share some highlights here. The following are rough transcripts edited for legibility. We’ve included clips of the relevant parts of the conversation and, if you’re interested, you can watch the entire conversation here:

What Does a Machine Learning Engineer Do?

Hugo: I want to learn more about your journey to being an AI engineering manager. But we’ve used the term ML engineering to talk about what it actually means: what does a machine learning engineer do in your experience?

Ethan: Yeah, so good question, I would say that maybe one definition is somebody who builds machine learning models, it could be, for example, a statistical model. And the boundaries between statistics and machine learning are pretty fuzzy. But I would say that they build this model, and it gets used in some sort of an automated fashion is probably the best definition that I have.

So you might have a statistician who is building a model, and then they, you know, run inference on this model, they interrogate the model in order to try to understand some behavior that exists. But this is more of an offline process that they’re doing. But I would say that a machine learning engineer ends up… they build the model. And maybe the model serves predictions behind an API, maybe the model runs once a day, and that generates some predictions that get run into a database. And then somebody else ends up using those predictions. Maybe, yeah, but I think the model gets used in some sort of an automated fashion, that’s probably the best definition that I have.

Hugo: So then, in some ways, it is a data scientist, who specifically builds models that are deployed to production or something along those lines. Yeah, yeah. Yep. But they’re not necessarily engineers per se because I think maybe the term engineer is so overloaded. We have ML engineers, data engineers, and platform engineers, and it’s more on the kind of scientific side focusing on the top levels of the stack that we’ll get to.

Ethan: Yes, I think so. At Square, we have the term machine learning engineer and I think that what I’ve said largely matches what those people do. There might be, you know, at a bigger company like Square, there is a platform team, where they are maybe responsible for building out a platform to allow for serving of models with low latency and high concurrency and everything else that you might care about at scale.

But the machine learning engineer, they’re the ones who are building the actual model, and probably the ones who are responsible for how the model impacts the ecosystem that it operates in. So your model generates predictions, and maybe somebody’s making decisions on those predictions. And so how do you track that process? That’s, that’s probably the job of the person who built the model.

Software Engineering Skills for Early Career Data Scientists

Hugo: In your first job as a data scientist, what was the focus? And what did other people around you do, maybe with more platform engineering skills and that type of stuff?

Ethan: Yeah, so that first job, even when I was trying to decide where to go, which ended up being an easy decision because I got one job offer from my first job. So I took that job. But yeah, it was high on my list originally.

But like, you know, we, when I was doing that boot camp, we had a bunch of companies come in and tell us what data science like what data science meant, and even back then there was kind of a split between people that are doing more analytics, and then people that are doing more, kind of what we would now call machine learning engineering. And I don’t feel like that was a very popular term back in 2015.

Back then I found myself gravitating more towards like, putting models into production. And I was interested in that. And this role seemed to have some of that. And so I would say that at my first job, for the recommendation systems, there were some things that I could do like I was definitely allowed to train models. For example, I kind of had free range over like a big cluster that existed in a data center for me to train models and things like that. But the actual model deployment process was sometimes a bit difficult and definitely outside of my wheelhouse.

I would also say that back then it depends on what you were working on in terms of what is the kind of production requirements required of somebody to push their code out. So, you know, for something like recommendation systems, it’s not. And if you’re at a reasonable scale, it’s not the end of the world, if somebody gets served a bad recommendation, I mean, if Amazon puts out a bad algorithm, you know, because they’re at such a scale, then maybe that is even going to cut into their margins. But at our scale, it was kind of lower stakes.

So I was allowed to kind of play in that world a bit even like, even though I did not have like a background in software engineering. Like any code that you look at, that you wrote, In the past, you know, you should shudder whenever you look at your old code, because ideally, you’ve learned something since then. And if I look back at the code that I wrote, then it was probably not production-quality code, I didn’t write tests, you know, any of these other kinds of hallmarks of modern software development. But that was okay at the time. One: because it was kind of low stakes that I was operating in. And two: because I think that nobody really expected that kind of production software engineering out of machine learning people back at that point in time, it was hard enough to find anybody who could just kind of program up some of these algorithms, to begin with. And so production level, software engineering would have just been very nice to have.

Essential ML Libraries for Data Scientists and MLEs

Hugo: What are the essential ML libraries that we need to know and use to help land a full-time data science job?

Ethan: I think they’re the same ones that have existed for a while. So like, if you’re on the Python stack, Scikit-learn is still very, very popular. We asked interview questions here where you’re allowed to use Scikit-learn and so if you know how to use that, then it’s a lot better than having to program logistic regression from scratch. And so I feel like basically the PyData stack Scikit-learn, pandas, NumPy, I think continue to be the workhorses of a lot of this If you need to know, deep learning, then TensorFlow or PyTorch are probably fine. And yeah, I don’t know, I, I think those are it. Spark is starting to get popular, I still don’t know Spark. So I’ve had to use it a tiny bit at my first job is really painful. And I’ve somehow managed to avoid it this entire time. And so I hope that some of you can as well.

Hugo: Yeah, I don’t know about landing jobs per se. But something which allows you to do gradient-boosted trees, XGBoost is one example. There are others as well. But yeah, I think if you’ve got, I mean, I always say that if you can build Random Forests, boosted trees, and do a bit of deep learning, that will help you add some most machine learning questions, right? So if you do some Scikit-learn some deep learning of some sort, and some boosted trees, you’re doing pretty well.

Ethan: 100% We’re big fans of XGBoost at Square.

Hugo: Awesome. And yeah, but the stack I mean, NumPy, pandas, and matplotlib, and a few of these other very like foundational PyData, SciPy packages are incredibly, incredibly useful.

Ethan: And I think just being comfortable with using those libraries. I know, pandas can be kind of complicated, but I still get confused by the API and everything else. But the more comfortable you are and the quicker you can kind of slice and dice your data, then it all that’s just going to pay dividends and definitely be helpful during interviews to move through the interview fast.

Hugo: Absolutely. And related, knowing your way around the Jupyter ecosystem, notebooks, and lab can be incredibly useful. And yeah, a bunch of basic bash and terminal stuff as well.

Ethan: Yeah. My team, they’re big fans of bash and make files. Yeah, I’m not going to tell anybody to go out and start building their own make files. But yeah, I think a lot of a lot of terminal stuff you will come across and it’s, it can only be helpful to get fast and nimble with that.

Hugo: Absolutely. And then we’re gonna get to this later, but some basic software engineering skills. I mean, I would, I would say maybe top two or three... version control... refactoring...

Ethan: Writing tests.

Hugo: Yeah! Testing, testing code. Also, data testing can be incredibly useful. Using something like pytest or something like that can be really cool.

Ethan: Our testing world is much harder than the software engineers testing world.

Hugo: Because it involves a lot of the real world.

Ethan: Yeah... data changes.

Measuring the Success of Machine Learning Projects

Hugo: How do you think about measuring the success of machine learning projects?

Ethan: So I wanted to join the risk team when I came here because I liked that their definition of success was very clear. So, you know, I mentioned that I used to work in recommendation systems. And you can run some AB tests with recommendation systems to convince yourself that you have proved valuable for the company, that you’ve made the company money and things like that. But sometimes, especially when you’re at a reasonable scale, you might need very long time periods to convince yourself that the improvements that you’ve made to an algorithm have given a large impact on the company. And so I was very interested in the risk team here because it’s at a large scale. And we’re dealing with money. And like the best way to kind of measure success is with money, usually, yeah.

So on the risk team, we could do things like that. So we could run AB tests where maybe you start sending some payments to one model and other payments, go to another model. And then you can measure, how much loss did we incur due to fraud from either model. And so you know, that ends up just being a fairly straightforward way. I mean, it’s not perfectly straight, because, with fraud, your losses are technically unbounded. So if you have a bad actor who figures out that they can steal your money, they’re gonna steal as much money as possible. And so it’s, it ends up being a bit of like a causal inference problem to actually do it, right? But that’s, that ends up being a very nice way to do this. On the chatbot side of things, we, the chatbot, can kind of back out and ask the business owner for help if it’s not able to solve a problem for somebody. And so we can measure, basically, how often it completes the goals that it has. And so that ends up being kind of a very clear measure of at least that model success.

What Exactly is the Full Stack of Machine Learning?

Hugo: Scientists really want to be thinking a lot more about the top layers of this stack, but have easy access to the bottom layers of the stack. That’s one that’s how we tend to think about it. I’m wondering how this resonates with you what you’d add to it and how you think about the full stack more generally.

Ethan: Yeah, I think that data scientists are largely like that here as well, where you know, you want to focus on the modeling, but you want access to all of the superpowers that the cloud gives you, right? So I want to be able to, I don’t want the scale of my data to matter. I don’t want to have to wait for what I’m doing. So I want, I want things to be fast, which might mean that I want things to work in parallel, I want to be able to store my data as much data as I want, you know.

Yeah, you want to focus on the modeling. I do think that, unfortunately, nowadays, well, unfortunately, right now. And to be clear, I haven’t used metaphor. So maybe this, this solves everything. But the more you try to avoid the lower parts of that stack, the more you try to avoid touching the cloud touching compute, and things like that, the harder time you’re just gonna have, because inevitably, like our abstraction layers are not very good right now. And so like I, you know, my team is on AWS. Right now, when I was on the fraud team here, they were on Google Cloud. And so I’ve like, worked across both clouds, we do have some platform-level tools that start to do a good job of abstracting away the compute layers, and things like that. But still, inevitably, you kind of run up into these, like, nasty edges to all of this. And so when you start to think of things like permissions, especially at a big company, where security and networking are very important things, then you start to bump up and into permissions, which ends up like pulling you down into that those lower layer layers of the stack.

But anyway, in terms of what like what I like, how I think about it, I do think that that orchestration layer that you had, I think that that’s like one highly important part, I think the like a big part of machine learning work is the fact that you have to work locally like you want to kind of write your code locally, but you can’t really do your work unless you’re in the cloud. And so, you know, software engineers, like a lot of times they can, they can write their application code locally, they can spin it up locally, they can test it, run all of their tests locally. And then they can kind of deploy it up into the cloud. But in our world, like training your model on test data locally, is, you know, you can find some bugs, and you can write some tests and things like that. But you can’t actually train your model often. Unless you’re doing it somewhere up in the cloud. Yeah, and so that’s a bit like, it’s kind of difficult to work in the cloud. And so you want to work locally, but then you need to deploy everything up into the cloud. And I think minimizing the cost of switching from local to the cloud is important, and is very difficult right now.

But that orchestration layer really, really is it, it’s like, Alright, I want to do all these things. But I want to do them up in the cloud in like, some agnostic way where maybe this code runs on this computer, this code runs on this other computer, and everything else. And so that, that ends up being kind of a huge, huge part of it.

Summary

After our fireside chats, we have async AMAs with our guests on our community slack. A lot tends to happen there so join us if you’re interested in such conversations! This week, for example, Michael Ward asked

On the topic of tests – often in traditional software dev, the test suite can be more important for ensuring consistent behavior of a system for future refactorings and feature additions.

Is it the same with ML codebases? Or perhaps, because system behavior can’t be as easily encoded into the type system, does the test suite serve more as an exploration of model edge cases?

And Ethan replied as follows:

I think you still need all of the testings of traditional software dev. This talk from Google looks at failure modes for one ML system of theirs’, and non-ML failure modes were still the most common. That said, it’s also helpful to perform testing of models beyond traditional software testing. For me, this is often simple behavioral testing, such as “the model’s loss should decrease on each epoch with a simple set of training data”.

Beyond all this, what really matters at the end of the day is how the models perform in the wild, and for that, we need good monitoring and analytics.

You can join us on slack for more such conversations and can view the other fireside chats here.