Enhancing Cancer Detection with Metaflow ML Pipelines
Inference time reduced from 6–8 weeks to 1 hour
Expanded GPU resources from 1 local machine to 50 GPUs in the cloud
Elimination of inference job failures: inference jobs run reliably without manual restarts
Valar Labs, a small but impactful startup in the medical field, is dedicated to improving cancer pathology by leveraging machine learning to assist doctors and pathologists. Mike Bentley Mills, a machine learning engineer at Valar Labs, was tasked with transforming their infrastructure. He has a decade of experience in the machine learning space and a career-long passion for building efficient pipelines. His story, which spans work at Apple, biotech, and drone companies, reflects a consistent mission: replace ad-hoc, unscalable solutions with structured and automated machine learning pipelines. Now, Mike applies his expertise to revolutionize how cancer detection models run at Valar Labs.
The Challenges Faced: Manual ML Workflows and Operational Bottlenecks
Before Mike joined Valar Labs, the machine learning infrastructure was typical of early-stage startups lacking refined processes. Engineers manually transferred code and data between laptops and local servers. "We had two servers, different Ubuntu versions, and no version control for environments. People just logged in with the same credentials, and no one knew which of the 14 conda environments were used for production," Mike explains.
The data was stored in shared folders, with inconsistent permissions across the board, and developers were manually syncing code between their laptops and servers. "They were running models on laptops and syncing the code with Rsync,” Mike recalls. “It was a ticking time bomb.”
On top of these operational inefficiencies, running inference on medical images was painfully slow. "It was taking them 6–8 weeks to process 1,000 slides of patient data on local machines," Mike adds. This bottleneck meant delayed diagnosis, which directly impacted patient care.
Finding the Solution: Metaflow for Structured, Scalable ML Pipelines
When Mike joined, he immediately recognized the need for a scalable solution. Drawing from his experience at previous companies, he introduced Metaflow, Netflix’s open-source framework for managing machine learning workflows.
The decision to adopt Metaflow came after hearing their pain points: lack of data management, environment inconsistencies, and time-consuming, error-prone manual processes. "I introduced Metaflow to the CTO by explaining that it could organize their pipelines, automate environment setup, version data, and offer cloud scalability," Mike explains.
A critical part of Metaflow’s appeal was its versioning capabilities. “Version control, both for code and the environment, is a game-changer. If a model behaves differently, you can track exactly what changed in the code and the environment,” Mike says. This feature was critical at Valar Labs, where environment inconsistencies were producing different results across servers.
Metaflow’s ease of scaling from local to cloud was a major selling point. “With Metaflow, we could easily run tests locally, and then switch to large-scale cloud instances to process thousands of slides using 50 GPUs. It eliminated the need for manual cloud setup, which used to be a huge pain.”
Results: A Drastic Improvement in Workflow Efficiency
Once Metaflow was integrated into the Valar Labs infrastructure, the results were transformative. By utilizing Metaflow’s powerful orchestration and scaling capabilities, the team saw a 99% reduction in inference time. “We went from six to eight weeks down to an hour,” Mike says. "Metaflow’s ability to turn cloud instances on and off as needed was a key factor in this improvement."
This meant that what once took weeks of GPU compute time could now be completed in just one hour, without the need for manual intervention. Developers could focus on iterating and improving models, instead of worrying about operational overhead. "Our developers can now take a model, run it on large datasets in the cloud, and get results back in an hour—this dramatically accelerates our development cycle," Mike explains.
In addition to the time savings, the automation of environment management also brought immense value. "Metaflow's handling of environments was crucial. The ability to encapsulate the exact version of every package meant that inconsistent results across machines became a thing of the past."
Mike also implemented flow-based pipelines for processing medical images. “The first flow I built was for converting large medical slides into a common format, all processed in the cloud. This freed up local resources for more critical tasks,” Mike elaborates. The second major implementation was an inference system that could handle massive parallel processing, allowing Valar Labs to analyze thousands of slides simultaneously, further reducing time to results.
By introducing Metaflow, Mike Bentley Mills not only fixed operational inefficiencies at Valar Labs but also laid the groundwork for future scalability. They went from a 6-8 week processing time down to 1 hour, automated environment handling, and added the ability to run on 50 GPUs simultaneously.
"We’ve already seen massive improvements, and this is just the beginning. As we continue to build out more flows and refine our process, I’m confident that we’ll be able to scale even further," Mike says. The adoption of Metaflow has been an unqualified success, and Mike is eager to keep pushing the boundaries of what's possible in cancer pathology.
Start building today
Join our office hours for a live demo! Whether you're curious about Outerbounds or have specific questions - nothing is off limits.