Transforming Proteomics: How MSAid Leveraged Outerbounds to Revolutionize Drug Discovery
FTE reduction: Freed up an entire engineer's time from managing infrastructure.
Reduced pipeline onboarding from two weeks to two days.
Faster data pipelines: Reduced processing time from 10 days to 1 day.
MSAID, a proteomics company based in Germany, uses machine learning to advance the analysis of mass spectrometry data for protein identification and quantification. The company’s mission is to simplify and accelerate protein analysis, which has significant implications for cancer research and biological discoveries. Led by Siegfried Gessulat, the company’s co-founder and Head of Machine Learning, MSAID combines expertise in computational biology and machine learning to build a data analysis pipeline. This pipeline integrates machine learning models at various stages to improve accuracy and efficiency, helping identify proteins more precisely from biological samples.
As MSAID's datasets and computational requirements expanded, the team faced increasing challenges with managing their data processing pipelines. They realized the need for a scalable solution to handle the complexity and growing size of the data while maintaining a high level of accuracy.
Scaling Complex Proteomics Workflows
Before adopting Outerbounds, MSAID managed their proteomics workflows using custom scripts, which quickly became unwieldy. The datasets they worked with, while plentiful, came from various labs that used different methods for collecting mass spectrometry data. Harmonizing these datasets was crucial but difficult due to the differences in how the data was processed.
Additionally, the pipelines themselves were becoming increasingly complex. Running data processing jobs on a single machine often took 10 days, delaying MSAID’s ability to integrate new datasets as quickly as they were published by the research community.
One of the biggest operational pain points was the amount of time and resources spent on infrastructure management. Gessulat explained that one engineer had to spend up to 50% of their time maintaining a Kubernetes cluster that kept the pipelines running. This was time that could have been better spent on improving the data processing and machine learning models rather than managing infrastructure.
“We had a lot of custom scripts and needed to track jobs manually. Processing a dataset could take 10 days on a single machine. We knew we had to find a more efficient solution,” Gessulat said, describing the challenges they faced before finding a scalable platform.
Why Outerbounds Was the Right Fit
Initially, MSAID began using Metaflow, Netflix’s open-source framework, to bring structure to their workflows. Metaflow helped by making jobs more observable and easier to manage, while also improving scheduling. However, as MSAID’s needs grew, they turned to Outerbounds for its managed platform, which further streamlined their operations by handling infrastructure management and deployment directly within their AWS account.
“Outerbounds allowed us to focus on what we’re good at—cleaning data and training models—without spending time managing infrastructure. One of the key benefits is that we no longer need to worry about maintaining our Kubernetes cluster,” Gessulat shared.
A key feature of Outerbounds that stood out for MSAID was its seamless integration with Metaflow, allowing the team to continue working in Python while gaining access to a more robust infrastructure solution. Another significant benefit was the interactivity provided by Outerbounds. The team could run pipelines, troubleshoot errors in real time by stepping into Jupyter notebooks, inspect data mid-process, and rerun jobs from where they had failed without starting from scratch.
“If a pipeline fails, I can jump into a Jupyter notebook, explore the data right at the point of failure, and then fix it. That interactivity is what sets Outerbounds apart from other platforms,” Gessulat explained.
This kind of flexibility and interactivity in managing complex workflows was a game-changer for MSAID, allowing them to scale their machine learning efforts without being burdened by infrastructure challenges.
A 10x Improvement in Processing Speed and 50% Reduction in Overhead
The impact of adopting Outerbounds was immediate and measurable. One of the most significant improvements was a 10x reduction in the time it took to process datasets. Where jobs had previously taken 10 days to complete, they now finished in just one day. This allowed MSAID to process new datasets faster than they were being published, keeping the company ahead of the rapidly growing volume of data in their field.
“We now process data faster than it's being published, which is a huge achievement. Our pipelines run 10 times faster, reducing processing time from weeks to a day,” Gessulat noted.
At the same time, Outerbounds eliminated the need for MSAID’s engineers to spend time managing the underlying infrastructure. The engineer who had previously spent 50% of their time maintaining the Kubernetes cluster no longer had to manage that, freeing up resources for more valuable work like data preparation and model optimization.
“Our infrastructure engineer now spends zero time managing the platform. That time has been fully repurposed for more valuable tasks,” Gessulat emphasized.
Another key improvement was in onboarding new team members. With the old system of custom scripts, it used to take one to two weeks for a new hire to get familiar with the pipelines and start contributing. Now, with the structured workflows provided by Metaflow and Outerbounds, new team members can be productive in just a few days.
“Onboarding is now reduced to just one or two days, allowing new team members to be productive almost immediately,” Gessulat added.
Improved Versioning and Tracking
One of the most valuable features of Outerbounds for MSAID has been its versioning and tracking capabilities. In the field of proteomics, being able to audit and trace data back to specific pipeline versions is crucial. With Outerbounds, MSAID now has full control over their pipelines, with the ability to track exactly which version of the pipeline produced a specific result.
“We have full control and audit capabilities now. We can trace back to any version of our pipelines and see exactly what data was used, what models were run, and what the results were,” Gessulat shared.
This level of transparency and control has increased MSAID’s confidence in their results, enabling them to deliver more accurate and reliable machine learning models for identifying proteins in biological samples.
Outerbounds has been transformative for MSAID, delivering measurable improvements across key areas of their operations. With a 10x increase in pipeline processing speed, the company is now able to stay ahead of the influx of new data from the research community. They have also eliminated the need for infrastructure management, freeing up valuable engineering resources. Onboarding times have been reduced from weeks to days, and the overall productivity of the team has significantly improved.
“We’ve gained speed, scalability, and efficiency with Outerbounds, allowing us to focus on what really matters—advancing our research and improving our models,” Gessulat concluded.
Looking ahead, MSAID plans to further leverage the automation features available in Outerbounds. Specifically, the team is exploring ways to chain multiple workflows together to reduce manual intervention even more. With Outerbounds handling their infrastructure needs, MSAID is well-positioned to continue scaling their machine learning models and improving their contributions to cancer research and other critical areas of proteomics.