Engineers from Chainguard and Outerbunds recently got together to discuss our common views around security and machine learning workflows. You can find the original version of this post on Chainguard's blog.
What do AI/ML infrastructure and secure container operating systems have in common? A whole lot, it turns out. Engineers from Chainguard and Outerbounds got to chatting recently and realized that we share a common philosophy, though we work on quite different problems. In this post, we’ll cover:
- What that philosophy is, and how it applies to both software supply chain security and data/ML infrastructure.
- How to assess the maturity of a machine learning workflow (and why the right time to improve it is now).
- The OSS tools (and products) that both Chainguard and Outerbounds offer data scientists, machine learning engineers (MLEs), and platform engineers, and how they can help.
Robustness is not the enemy of velocity
We’re driven by the belief that infrastructure must make developers and data scientists happier and more productive, and it can do so while improving reliability, debuggability, reproducibility, and security.
How can this be possible? The answer: thoughtful design executed well. For instance, software must do the right thing by default. At Chainguard, that means secure by default: security measures must be integrated into every phase of the software development lifecycle, from dependencies to development to production. At Outerbounds, this means that the simplest way to write machine learning workflows that seamlessly scale from your laptop to large fleets of GPUs and back, allowing rapid iteration between prototype and production. In both cases, this allows data scientists, software engineers, and MLEs to focus on the top layers of the stack, while having easy and robust access to the infrastructural layers:
Because we share this philosophy, together we have started to explore what it would mean to secure the machine learning supply chain without sacrificing velocity: operations teams and scientists can get along if they have the right tools.
What are we seeing today?
The correctness and security of a machine learning model depends on the training data itself, secure delivery of the training data, software dependencies (e.g. TensorFlow), the model code, the training environment, build steps to package up the trained model, and the deployment of that model to production. Together, these form the machine learning supply chain.
Chainguard has previously argued that there’s no difference in principle between a software supply chain and a machine learning supply chain when it comes to security—and the difference in practice is that the ML world is much less mature. Remember how software development looked 20 years ago, with no automated testing, code review, or even version control at many shops? Data science workflows are at a similar evolutionary phase.
This isn’t the fault of data scientists, who need to use cutting-edge techniques (and consequently, bleeding-edge, barely-tested libraries, which may have malware) and for whom rapid experimentation is table stakes. Rather, it reflects the early stages of tooling and infrastructure supporting these workflows.
Below, we describe a few desirable properties for ML workflows, and contrast standard practice with a gold standard workflow, with stories of the consequences.
By reproducibility, we mean that we should be able to take a trained model and recreate it, ideally with one button push years in the future. Reproducibility has a number of benefits:
- Reproducibility increases confidence in an analysis, from a scientific perspective.
- When collaborating, a shared environment avoids inconsistencies and hours trying to make development environments match up.
- Results in production will exactly mirror results from development or training.
- Reproducibility enables auditing for security: if one training environment is compromised, we can detect that by comparing results from a different environment.
- From sales funnels to cohort analysis to churn prediction, reproducibility of analysis is critical for the stability of business revenue. Typical. A data scientist installs some dependencies, downloads data locally, and spins up a Jupyter notebook. After executing the notebook’s cells in an unpredictable order (some, several times in a row) while continuously modifying them, they export their model and call it a day.
Months later, the team learns that the data had errors. They’d like to update the model to reflect the corrected data, but they’re not even sure what dataset was used. They make a guess, but after spending a couple of days trying to figure out what dependencies the notebook has, the predictions are wildly off. This is due to a change in the behavior of some optimization function between versions of PyTorch, and the fact that the first training run had a lucky random seed that allowed their training process to converge—but they’ll never know that.
Better. While our data scientist uses Jupyter for local exploration, when preparing the model for production they commit a script to source control. They lock down the requirements of the model and their exact versions, and manually set a random seed.
This process works well for a few months until the data scientist realizes that they’ll get promoted faster if they ship more models to production, and they reclaim the extra time they had been spending diligently documenting dependencies and use it to run new experiments. When the team needs to retrain a year-old model one day, it doesn’t converge due to an implicit operating system dependency that wasn’t in
Best. The scientist uses tools that (1) automatically capture and make easily accessible ALL information key for reproducibility and (2) guarantee rerunnability: we’re not only talking code here, but data, models, packages, and environments, to name a few! Using the right framework, such as Metaflow, this isn’t any more work for a data scientist than using a Jupyter notebook.
Data sources and formats
The data used as input to an ML workflow can introduce vulnerabilities and run attacker code just like software dependencies (especially in the insecure
pickle format). Unlike attacker code in a dependency, it tends not to have obvious giveaways, like
eval() functions run on long base64-encoded blobs.
Typical. A data scientist uses
curl to fetch a dataset from some strange corner of the web; they record where in a comment. The training code loads data from
/Users/alice/data-v2-final-actually-final.pickle. They save their trained model (again in
.pickle format) and attach it to their repository using Git LFS.
When the dataset gets taken down, nobody is able to change the model ever again. And when the domain hosting the dataset expires, an attacker registers it and puts a
.pickle file that looks for credit card numbers from the machine and sends them to a hostile server.
Better. Our scientist retrieves data from known sources (who have granted access), and copies it to their organization’s servers before using it. The training code fetches the data in an organization-wide way by some canonical name. The input data is a CSV, but the resulting model is pickled.
Others are able to work on the model, but weird inconsistencies arise when it turns out that there are two different versions of
housing-data-2017.csv. When an advanced attacker compromises the training infrastructure, they add a backdoor into the resulting
.pickle model and start exfiltrating data from production.
Best. Data comes from known sources with accompanying data bills of materials, indicating the source (and a description) of the data, along with a cryptographic hash of the data itself and a signature from the party that produced the data set. Nothing is ever pickled—the data lives in formats like AVRO and Parquet.
Data and machine learning pipeline security
Training a model is a direct analog to building a software artifact. Efforts like SLSA stress the importance of moving builds off of developer laptops to locked-down build machines which are secured just as well as production systems. This is doubly important for ML systems, where the “source” (data) might contain sensitive customer information that could cause catastrophe if leaked.
Typical. Data scientists copy a production database containing personally identifiable information (PII) to their laptop in order to train their models.
One employee uses this as an opportunity to stalk their ex. After clicking a link in an email, one data scientist accidentally installs some malware which exfiltrates all of these users’ credit card numbers to a cybercrime gang’s servers. When another data scientist travels through a foreign airport, the security agent confiscates it and clones the hard drive; the foreign government now knows the health status of millions of people.
Better. Models are trained on a secure training platform. Data scientists don’t have direct access to any of this data: the credentials live on this platform.
User data is much safer: nobody can just flip through it. A new deployed model never has a regression because somebody had an old version of PyTorch.
The scientists grumble because their job just got much more tedious. Without any access to the training process, they accidentally deploy a model with terrible performance because it was trained against the wrong dataset. When a state-sponsored attacker compromises the Jenkins box, which hasn’t received a security update in 5 years, they’re able to backdoor the output model in a Solarwinds-style attack; this attack persists for months.
Best. Training happens in a hosted, production-caliber environment like Outerbounds Platform. Only reviewed workloads can see the data.
What should we do about it?
While the picture painted above looks a little bleak, our prognosis is positive. Just as the maturity of software development skyrocketed over the past 20 years, the maturity of developing and deploying ML systems will soon take off. Efforts from governments, open source communities, and for-profit companies are currently articulating risks and developing best practices and tooling to support these practices. For instance, NIST’s AI Risk Management Framework (RMF) provides organization-level guidance for how to invest to mitigate these risks. The OpenSSF’s proposed AI/ML working group aims to develop standards and best practices with a focus on ML supply chains that include open source (pretty much all of them). The Sigstore project enables signing both traditional software and ML data, source, and models, and Chainguard Enforce can guarantee that production workloads have all required signatures. Outerbounds developed an open-source tool, Metaflow, that’s easier than doing things the “typical” way but supports the best practices highlighted above. And using Chainguard’s Wolfi as a base operating system for running notebooks, training workflows, or deployed models minimizes the attack surface.
If you’d like to learn more, attend the upcoming Supply Chain Security in Machine Learning Fireside Chat on Thursday, August 24 at 12pm EST, featuring Chainguard’s Research Scientist Zack Newman and Outerbounds’ Head of Developer Relations Hugo Bowne-Anderson.