Skip to main content

The Moving Parts of the Full Machine Learning Stack and Building ML Platforms

We recently had a fireside chat with Russell Brooks ( about How to Build an Enterprise Machine Learning Platform from Scratch. The conversation revolved around what building an enterprise ML platform from scratch looks like in practice, including the journeys he experienced at both OpCity and, where he took both organizations from a bus factor of 1 to reproducible and automated ML-powered software.

We covered a lot of ground so we wanted to share some highlights here. The following are rough transcripts edited for legibility. We’ve included clips of the relevant parts of the conversation and, if you’re interested, you can watch the entire conversation here:

The Value that Machine Learning Creates for

Hugo: Why does do machine learning and data science? What type of value do these functions create for

Russell: Yeah, ultimately, that’s the number one question to answer, right? I mean, if ML is doing nothing, and data science doing nothing, our company is just throwing money down the drain. A lot of money! I think the answer is sometimes yes, actually, for quite a few companies. But I think in aggregate, I think long story short, it’s definitely providing value for a lot of people and at, it’s a real estate platform that just does business in the United States. It competes with companies like Zillow, things like that, where it’s an aggregator of all the MLS data, all the real estate homes for sale rent apartments for rent across the entire country, to grab that contents, aggregate it, make it nice, presentable, clean, and, you know, help consumers find a place to live.

That’s ultimately what kind of goal of the company, and from a machine learning standpoint, you can visit, I mean, a whole suite of stuff, there’s personalization on the front end consumer-facing website, there’s search, there are recommendation engines, there’s, of course, image data that we get, NLP data that we get, we have deep learning models for that, we have models to help people find real estate professionals like real estate agents that are well suited to their interests, like their price range where they live, all sorts of different factors that they might have that are specific to them. We have probabilistic models for that, that are Bayesian, we have tree-based models, we have the old game that we have, we even have audio data that we did have some other kind of like deep learning models that can run over stuff.

Hugo: Yeah, great. And something else I’m hearing in there is that I mean, a lot of things, including real estate, can boil down to matching problems in a lot of ways. So finding the right set of properties for the right user. So it’s really, I mean, machine learning seems not only interesting but almost fundamental in the computation age.

Russell: Sure. I mean, there’s not only a huge variability in housing, but there’s also a huge fluctuation in housing inventory, what is available, when, where, what’s relevant, how can you notify me so I can get, you know, because I think that a people when they’re on a home buying process, or even in an apartment process, like, they have a goal, it’s a very realistic real-world thing that they need to solve, which is like, I need shelter. It’s not just like, I’ve gotta go watch some video on YouTube to entertain myself, it’s usually like, how can I tangibly solve a problem that I’m trying to solve in my life?

How to Demonstrate the Value of Data Science and ML in Your Organization

Hugo: I’m interested in how you think about demonstrating the impact of data and machine learning functions in organizations such as Opcity, and, essentially getting everyone on board for the project.

Russell: There are definitely a lot of ways in a pure business sense, right? There’s of course monetization, right? So that’s, you know, if you have the ability to measure your impact on revenue, whether that’s directly on revenue, or via proxy metrics, you know, things like your kind of normal, say, website traffic, where it’s like, what’s the click through rate that’s, well, time, the, you know, user engagement and churn rates that, you know, kind of pretty standard website metrics nowadays, you have those.

But there’s even a lot of interesting specifics to real estate, so, you know, it’s very slow moving, it’s the real world, like buying a house takes months, even if you’re moving quickly, so being able to measure impact is not always an immediate response, where you can, you know, very quickly to say, Oh, I’m going to build a real time model feedback loop that can grab it, and it’s going to retrain every day and this and that, and Yes, I can for certain parts of it, but you also have to acknowledge that there are other parts where for ground truth data for the actuals you know, it takes months for some of this stuff.

So in terms of adding impact, you know, one way to almost add impact is even just to let people move more quickly. So we even have a whole suite of models that are basically just like, you know, time to event modeling type stuff where we can say, okay, rather than having to wait six months to observe, if something’s gonna happen, maybe we can, you know, build predictive models that with a high degree of competence can say, this will or will not happen, but instead of six months, maybe it’s one month, and in turn, we can then experiment and iterate much more quickly.

How to Build an Enterprise ML Platform from Scratch

Hugo: you built out the entire machine learning platform at, taking both organizations from a bus factor of one to reproducible automated machine learning-powered software. Tell us about this journey. 

Russell: Yeah, I mean, I think, an interesting component of how you build these things, there’s also the industry has been actively coalescing on its direction over time as well. I mean, if you think just like 5 or 10 years ago, machine learning was kind of it was starting to get a lot of traction, a lot of companies kind of wanted to have it didn’t really have it, I would even argue to this day, most companies especially not like tech companies are still figuring it out. They’re still trying to decide how many people do we need for this. Is it part of analytics? Is it part of engineering? Is it its own thing?

So I think there’s very much still like a lot of uncertainty even and just like the organizational sense of like, how these things work, much less how that organization, in turn, operates on infrastructure? Is it a shared AWS account? Is it a separate AWS account? Do you have engineering people that, put you in a little sandbox and say, Here’s your sandbox, have fun, don’t create production? There are so many ways you could do this, even depending on the specifics of your company, to the opposite example of being a startup. It’s kind of nice, right? Because you like a clean slate.

So it’s like, okay, one that’s, you know, there’s no tech debt. That’s cool. You don’t have to worry about like, Oh, am I going to like piss off someone when I say let’s go rewrite this thing from scratch because it sucks. Instead, it’s like, okay, I got it, I can just use industry best practices straight from the get-go. That’s cool. But, you know, the flip side of that is also, you have a clean slate, it’s kind of like, like, it’s kind of overwhelming, if you don’t even know like, what, you know, you log into AWS, there’s like, like, 1500 services, or whatever they have now. Yeah. Like, where do you start? Right?

For us, it was basically like, okay, you know, you don’t have many people. So how it started was basically just like self-solving for yourself at that point. And it’s like, Alright, let’s get some EC2 instances going, let’s just kind of do what we were doing on our laptop, let’s just do it over here on this AWS server, and kind of the laptop in the cloud kind of mentality that I know you guys have in your documentation as well, which is, I think, really empowering like cloud-first development as a paradigm, I think is here to stay, it is only going to become more prevalent, more common, just because it makes sense. 

Imagine you’re trying to query some database, grab some data. And if you have to, download that data set, like over the public internet, to your home, Wi-Fi onto your laptop, just to like do something to them. Just keep it in AWS, it’s safer. It’s more secure. It’s higher bandwidth. It’s just better all around, really. But so basically, it’s kind of start with that and then gradually add complexity. So you start with that we’re like, okay, we need data, how do we get it do we attach to our production database, let’s just make a copy will spin up our own separate kind of replicate database, we’ll hit that.

And then gradually, it’s like, okay, we need automation. Let’s start using AWS batch and step functions. And this is actually back pre-Metaflow. At this point right now, just for kind of context so when Metaflow really kind of starts sitting in was really pretty much right after the open source release. So in 2019, I guess, I think around there, where there was an AWS reinvent presentation that I was watching. And I remember just checking it out. And as I was watching it, I was like, oh my god, this is like all the same services were already there just building a really nice Pythonic abstraction for all the same crap that I was having to do manually by hand. Like making like like AWS states language-type things for step functions. If anyone’s ever had to do that by hand, I feel sorry for you. It sucks.

Hugo: Some of our audience may not know what Metaflow is. Would you give me the elevator pitch on Metaflow?

Russell: I mean, I think the high level is just like, it’s a clean abstraction on top of the data science and machine learning lifecycle projects might be like, high level behind that, right? Like, what does that even mean? A machine learning lifecycle? You know, there’s infrastructure components to that? There are the organizational aspects of that, like, who owns this code? What happens when it breaks? The someone on call, how do I check what it was when it failed? So there’s like workflow components, organizational components, infrastructure, kind of like infrastructure as code kind of components to it.

But broadly, I would just say, even back then, when I saw the presentation, I was like, okay, presentation, cool. I’m gonna just kind of talk through this and see if it works. So within, a day or two, I basically just downloaded the code, install the kind of just like, made a little POC on top of some projects that I already had employed. And, to my surprise, I was like, this is actually very nice. Yeah, you know, because there were even other things that we’re trying to do the same abstractions back then for like AWS batch, and like step functions, like you could use the way it was like, the AWS CDK. And it was like this, like kind of trying to be like, infrastructure as code abstraction on top, but it still sucked. It wasn’t good.

Hugo: And one of the things, I think that really resonates with me, and one way for our audience that I think about Metaflow is that it’s a framework or library that allows you to write idiomatic, Python, and write all your science stuff for modeling features, that kind of stuff, while allowing you to access all the infrastructure resources, such as batch or step functions, or Kubernetes, or whatever it is, right? So that you can write the code that you want to write without messing around in configuration files all the time, it gives you access to version control for your models, and data, and all of these, all of these types of things as well. But you raise an interesting point that one of the value props, or something I’ve heard a lot of people value in Dataflow is that it is idiomatic Python, you don’t have to learn any special SDK or anything along those lines, right? You can write the code, you know, and love as quickly as possible.

Russell: I think one thing that I would like to summarize is that there are far too many tools in the ML space right now. Like, I feel like we’re at an inflection point where like, we got a lot of hype, there’s a lot of VC money coming in building a lot of tools, tools are definitely needed. But I think we also kind of need to let the tools, fight it out, print it down, and distill it down into some really nice, clean interfaces. That will be, I think, very nice for years to come in the future. But yeah, with respect to that, I think Metaflow is like, it’s one of those things where like many tools are functionally correct. But like, not very many things are tastefully correct? And the interface of Metaflow, in my mind, feels correct. That’s very, very nice.

Software and Platform Engineering Skills for Data Scientists

After our fireside chats, we have async AMAs with our guests on our community slack. A lot tends to happen there so join us if you’re interested in such conversations! This week, for example, Hugo asked

It was really fun chatting about SWE skills for data scientists. I'm wondering what advice you would give data scientists who are interested in becoming platform engineers and vice versa?

And Russell replied as follows:

right back at ya Hugo!

my advice would be that it’ll certainly pay dividends regardless of which end of that continuum you’re coming from – if it’s something that interests you, dive right in!

there’s tons of value in having a “T-shaped” skillset, with a bit of exposure across the stack outside of your norm, even if it’s just helping you more effectively communicate across teams. I’m a fan of the philosophy outlined here.

more concretely for each, I’d say:

DS ➡️ Platform Eng

Get exposure to the infra that you’re already working with and using day-to-day (and is likely abstracted away a fair bit) – things like containerization, distributed processing, job schedulers/orchestration, monitoring/alerting, software-defined networking, and security. It can be easy to take for granted the complexity of infra, especially when it’s working well! Pick something you use, and try to dig down into its constituent parts – e.g. if you’re using a cloud-based VSCode devcontainer, how is that provisioned onto hardware? How are you connecting to that hardware? Does it use a public IP or a service like AWS SSM for secure communication? Does it go through your company Okta SSO or another means of authentication with TTL? Peel back the abstraction layers to try to understand how your tools fit together into a platform.

Platform Eng ➡️ DS

Investigate pain points you hear talking with your DS/ML teams – it’s nice to have collaboration, they’ll likely be super grateful for the help, and along the way you can gain exposure to the types of problems the business is trying to solve. From those problems, find one that sounds interesting to you and use it as a creative outlet to apply what you’re learning and try to replicate some modeling techniques. Think if there are any unique new features you could add, try them out, and realize that you might have an entirely different perspective or novel insight into the business process that the DS might not have had! Share your findings and ask for help critiquing the model/features!

Join us on slack for more such conversations and also join us for our next fireside chat:  Navigating the Full Stack of Machine Learning with Ethan Rosenthal (Square). We’ll be discussing the wild west of full stack machine learning and how to make sense of all the feature stores, metric layers, model monitoring, and more with a view to deciphering what mental models, tools, and abstraction layers are most helpful in delivering actual ROI using ML. You can sign up here!

Smarter machines, built by happier humans

The future will be powered by dynamic, data intensive systems - built by happy humans using tooling that gives them superpowers

Get started for free