Skip to main content

Applying the Practical Machine Learning Infrastructure Stack

Ville Tuulos
Hugo Bowne-Anderson

We recently wrote an essay for O’Reilly Radar about the machine learning deployment stack and why data makes software different. The questions we sought to answer were:

  1. Why does ML need special treatment in the first place? Can’t we just fold it into existing DevOps best practices?
  2. What does a modern technology stack for streamlined ML processes look like?
  3. How can you start applying the stack in practice today?

Our real goal was to begin a conversation around how to build a foundation of a canonical infrastructure stack, along with best practices, for developing and deploying data-intensive software applications. As this needs to be an ongoing conversation, we thought to write more on each of the points above. In two previous posts, we answered the first two questions:

Why does ML need special treatment in the first place?


What does a modern technology stack for streamlined ML processes look like?

In doing so, it became clear why we can’t just fold machine learning into existing DevOps best practices. It also became apparent that patterns are emerging in terms of the major infrastructural layers across a large number of projects:

Adapted from the book Effective Data Science Infrastructure In this post, we’ll dive right into the third question:

How can you start applying the stack in practice today?

How: Wrapping the Stack for Maximum Usability

Imagine choosing a production-grade solution for each layer of the stack: for instance, Snowflake for data, Kubernetes for compute (container orchestration), and Argo for workflow orchestration. While each system does a good job in its own domain, it is not trivial to build a data-intensive application that has cross-cutting concerns touching all the foundational layers. In addition, you have to layer the higher-level concerns from versioning to model development on top of the already complex stack. It is not realistic to ask a data scientist to prototype quickly and deploy to production with confidence using such a contraption. Adding more YAML to cover cracks in the stack is not an adequate solution.

Many data-centric environments of the previous generation, such as Excel and RStudio, really shine at maximizing usability and developer productivity. Optimally, we could wrap the production-grade infrastructure stack inside a developer-oriented user interface. Such an interface should allow the data scientist to focus on concerns that are most relevant for them, namely the topmost layers of the stack while abstracting away the foundational layers.

The combination of a production-grade core and a user-friendly shell makes sure that ML applications can be prototyped rapidly, deployed to production, and brought back to the prototyping environment for continuous improvement. The iteration cycles should be measured in hours or days, not in months.

Metaflow is an open-source framework, originally developed at Netflix and now supported here at Outerbounds, specifically designed to address this concern: How can we wrap robust production infrastructure in a single coherent, easy-to-use interface for data scientists? Under the hood, Metaflow integrates with best-of-breed production infrastructure, such as Kubernetes and AWS Step Functions, while providing a development experience that draws inspiration from data-centric programming, that is, by treating local prototyping as the first-class citizen.

When evaluating solutions, consider focusing on the three key dimensions covered in this series of posts:

  1. Does the solution provide a delightful user experience for data scientists and ML engineers? There is no fundamental reason why data scientists should accept a worse level of productivity than is achievable with existing data-centric tools.
  2. Does the solution provide first-class support for rapid iterative development and frictionless A/B testing? It should be easy to take projects quickly from prototype to production and back, so production issues can be reproduced and debugged locally.
  3. Does the solution integrate with your existing infrastructure, in particular to the foundational data, compute, and orchestration layers? It is not productive to operate ML as an island. When it comes to operating ML in production, it is beneficial to be able to leverage existing production tooling for observability and deployments, for example, as much as possible.

It is safe to say that all existing solutions still have room for improvement. Yet it seems inevitable that over the next five years the whole stack will mature, and the user experience will converge towards and eventually beyond the best data-centric IDEs.  Businesses will learn how to create value with ML similar to traditional software engineering and empirical, data-driven development will take its place amongst other ubiquitous software development paradigms.

If you’d like to get in touch or discuss such issues, come say hi on our community Slack! 👋

Further Reading

See all →