We recently wrote an essay for O’Reilly Radar about the machine learning deployment stack and why data makes software different. The questions we sought to answer were:
- Why does ML need special treatment in the first place? Can’t we just fold it into existing DevOps best practices?
- What does a modern technology stack for streamlined ML processes look like?
- How can you start applying the stack in practice today?
Our real goal was to begin a conversation around how to build a foundation of a canonical infrastructure stack, along with best practices, for developing and deploying data-intensive software applications. As this needs to be an ongoing conversation, we thought to write more on each of the points above. In a previous post, we answered the first question:
Why does ML need special treatment in the first place?
In doing so, it became clear why we can’t just fold machine learning into existing DevOps best practices.
In this post, we’ll dive right into the second question:
What does a modern technology stack for streamlined ML processes look like?
What kind of foundation would the modern ML application require?
It should combine the best parts of modern production infrastructure to ensure robust deployments, as well as draw inspiration from data-centric programming to maximize productivity.
While implementation details vary, the major infrastructural layers we’ve seen emerge are relatively uniform across a large number of projects. Let’s now take a tour of the various layers, to begin to map the territory. Along the way, we’ll provide illustrative examples. The intention behind the examples is not to be comprehensive (perhaps a fool’s errand, anyway!), but to reference concrete tooling used today in order to ground what could otherwise be a somewhat abstract exercise.
Adapted from the book Effective Data Science Infrastructure
Foundational Infrastructure Layers
Data is at the core of any ML project, so data infrastructure is a foundational concern. ML use cases rarely dictate the master data management solution, so the ML stack needs to integrate with existing data warehouses. Cloud-based data warehouses, such as Snowflake, AWS’ portfolio of databases like RDS, Redshift or Aurora, or an S3-based data lake, are a great match to ML use cases since they tend to be much more scalable than traditional databases, both in terms of the data set sizes as well as query patterns.
To make data useful, we must be able to conduct large-scale compute easily. Since the needs of data-intensive applications are diverse, it is useful to have a general-purpose compute layer that can handle different types of tasks from IO-heavy data processing to training large models on GPUs. Besides variety, the number of tasks can be high too: imagine a single workflow that trains a separate model for 200 countries in the world, running a hyperparameter search over 100 parameters for each model – the workflow yields 20,000 parallel tasks.
Prior to the cloud, setting up and operating a cluster that can handle workloads like this would have been a major technical challenge. Today, a number of cloud-based, auto-scaling systems are easily available, such as AWS Batch. Kubernetes, a popular choice for general-purpose container orchestration, can be configured to work as a scalable batch compute layer, although the downside of its flexibility is increased complexity. Note that container orchestration for the compute layer is not to be confused with the workflow orchestration layer, which we will cover next.
The nature of computation is structured: we must be able to manage the complexity of applications by structuring them, for example, as a graph or a workflow that is orchestrated.
The workflow orchestrator needs to perform a seemingly simple task: given a workflow or DAG definition, execute the tasks defined by the graph in order using the compute layer. There are countless systems that can perform this task for small DAGs on a single server. However, as the workflow orchestrator plays a key role in ensuring that production workflows execute reliably, it makes sense to use a system that is both scalable and highly available, which leaves us with a few battle-hardened options, for instance: Airflow, a popular open-source workflow orchestrator; Argo, a newer orchestrator that runs natively on Kubernetes, and managed solutions such as Google Cloud Composer and AWS Step Functions.
Software Development Layers
While these three foundational layers, data, compute, and orchestration, are technically all we need to execute ML applications at arbitrary scale, building and operating ML applications directly on top of these components would be like hacking software in assembly language: technically possible but inconvenient and unproductive. To make people productive, we need higher levels of abstraction. Enter the software development layers.
ML app and software artifacts exist and evolve in a dynamic environment. To manage the dynamism, we can resort to taking snapshots that represent immutable points in time: of models, of data, of code, and of internal state. For this reason, we require a strong versioning layer.
While Git, GitHub, and other similar tools for software version control work well for code and the usual workflows of software development, they are a bit clunky for tracking all experiments, models, and data. To plug this gap, frameworks like Metaflow or MLFlow provide a custom solution for versioning.
Next, we need to consider who builds these applications and how. They are often built by data scientists who are not software engineers or computer science majors by training. Arguably, high-level programming languages like Python are the most expressive and efficient ways that humankind has conceived to formally define complex processes. It is hard to imagine a better way to express non-trivial business logic and convert mathematical concepts into an executable form.
However, not all Python code is equal. Python written in Jupyter notebooks following the tradition of data-centric programming is very different from Python used to implement a scalable web server. To make the data scientists maximally productive, we want to provide supporting software architecture in terms of APIs and libraries that allow them to focus on data, not on the machines.
Data Science Layers
With these five layers, we can present a highly productive, data-centric software interface that enables iterative development of large-scale data-intensive applications. However, none of these layers help with modeling and optimization. We cannot expect data scientists to write modeling frameworks like PyTorch or optimizers like Adam from scratch! Furthermore, there are steps that are needed to go from raw data to features required by models.
When it comes to data science and modeling, we separate three concerns, starting from the most practical progressing towards the most theoretical. Assuming you have a model, how can you use it effectively? Perhaps you want to produce predictions in real-time or as a batch process. No matter what you do, you should monitor the quality of the results. Altogether, we can group these practical concerns in the model operations layer. There are many new tools in this space helping with various aspects of operations, including Seldon for model deployments, Weights and Biases for model monitoring, and TruEra for model explainability.
Before you have a model, you have to decide how to feed it with labelled data. Managing the process of converting raw facts to features is a deep topic of its own, potentially involving feature encoders, feature stores, and so on. Producing labels is another, equally deep topic. You want to carefully manage consistency of data between training and predictions, as well as make sure that there’s no leakage of information when models are being trained and tested with historical data. We bucket these questions in the feature engineering layer. There’s an emerging space of ML-focused feature stores such as Tecton or labeling solutions like Scale and Snorkel. Feature stores aim to solve the challenge that many data scientists in an organization require similar data transformations and features for their work and labeling solutions deal with the very real challenges associated with hand labeling datasets.
Finally, at the very top of the stack we get to the question of mathematical modeling: What kind of modeling technique to use? What model architecture is most suitable for the task? How to parameterize the model? Fortunately, excellent off-the-shelf libraries like scikit-learn and PyTorch are available to help with model development.
An overarching concern: Correctness and Testing
Regardless of the systems we use at each layer of the stack, we want to guarantee the correctness of results. In traditional software engineering we can do this by writing tests: for instance, a unit test can be used to check the behavior of a function with predetermined inputs. Since we know exactly how the function is implemented, we can convince ourselves through inductive reasoning that the function should work correctly, based on the correctness of a unit test.
This process doesn’t work when the function, such as a model, is opaque to us. We must resort to black box testing – testing the behavior of the function with a wide range of inputs. Even worse, sophisticated ML applications can take a huge number of contextual data points as inputs, like the time of day, user’s past behavior, or device type into account, so an accurate test set up may need to become a full-fledged simulator.
Since building an accurate simulator is a highly non-trivial challenge in itself, often it is easier to use a slice of the real-world as a simulator and A/B test the application in production against a known baseline. To make A/B testing possible, all layers of the stack should be be able to run many versions of the application concurrently, so an arbitrary number of production-like deployments can be run simultaneously. This poses a challenge to many infrastructure tools of today, which have been designed for more rigid traditional software in mind. Besides infrastructure, effective A/B testing requires a control plane, a modern experimentation platform, such as StatSig.
Thanks for reading and keep your eyes open for our post on how you can start applying the stack in practice today! If you’d like to get in touch or discuss such issues, come say hi on our community Slack! 👋