Stories

Streamlining Data Science Operations with Metaflow: How DTN Transformed Productivity

Case Study image
1,011%

Increase in workflow scale - From 1,800 to 20,000 steps per month.

4x

Faster model deployment time: from over a month to less than a week.

50%

Reduction in operational bottlenecks: double the number of models managed after Metaflow.

Name
DTN
Founded
1984
Location
Bloomington, Minnesota
Industry
Software and Technology
Focus on Machine Learning
Predicting critical events like fuel shortages, crop yields, and storm risks to optimize resource planning and minimize disruptions.
ML Models
No items found.
No items found.

DTN is a leading provider of data-driven insights across several industries, including agriculture, weather, and fuel logistics. With a focus on delivering timely and accurate data, DTN helps businesses make critical decisions by predicting events like fuel shortages, crop yields, and storm risks. These forecasts allow their customers to plan for and mitigate potential disruptions, ensuring operational continuity.

Tyler Potts, DTN's Data Science Platform Lead, joined the company with the mission to overhaul their data science infrastructure. With a background in neuroscience and data science, Tyler’s journey into machine learning began after attending a bootcamp where he discovered his passion for data-driven scientific computing, and spending several years as a consultant in the data-driven scientific computing space. After working in government contracting, where he helped build the first data science platform for the Air Force's F-16 project, Tyler took on the challenge of transforming DTN's machine learning operations.

When Tyler arrived, DTN’s primary machine learning goal was to harness their vast stores of data to generate actionable insights. Their customers rely on these insights to make informed business decisions, whether predicting weather-induced power outages or optimizing fuel deliveries. With 20 data scientists on his team, Tyler’s role was to ensure that the infrastructure could support their projects at scale and efficiently integrate machine learning into the organization’s daily operations.

The Challenge: Stable Chaos in Data Science Operations

Upon joining DTN, Tyler was greeted by an operational environment that he described as "stable chaos." The team had a functioning system, but it was far from optimal. The data scientists were using a homegrown solution built around Docker containers to package Python code and environments, which were then run on AWS Elastic Container Service (ECS). While this setup allowed them to perform machine learning tasks at scale, it came with several critical drawbacks.

One of the biggest issues was time inefficiency. Every time a data scientist made a change to their code or environment, it would take a minimum of 15 minutes to rebuild the Docker container and push it to AWS. This delay added unnecessary friction to the iterative process of machine learning experimentation. As Tyler explained, “Data scientists don’t want to be building Docker containers—they want to be building models.”

Additionally, debugging in this environment was a cumbersome and frustrating process. There was no intuitive UI to help track down errors, and the logs were hidden behind cryptic AWS UIDs, making it difficult to identify which containers had failed and why. This lack of transparency slowed down the team’s ability to troubleshoot issues and resulted in delays.

The team was also dealing with hand-rolled solutions that had been developed internally. While these solutions worked, they came with significant maintenance overhead. Tyler recognized the inefficiency of maintaining custom infrastructure when more mature and well-supported tools existed in the wider data science community. “Hand-rolled solutions require a ton of maintenance, and you don’t leverage the lessons and cuts and bruises that the wider community has already experienced,” Tyler noted.

Finally, with minimal operational support, the team was unable to scale its operations effectively. Before Tyler joined, DTN had only three operational staff supporting 20 data scientists, and this imbalance left a vacuum in infrastructure support.

The Solution: Introducing Metaflow

To solve these challenges, Tyler turned to Metaflow, an open-source machine learning framework developed by Netflix. Tyler had previously encountered Metaflow at a SciPy conference and had been impressed with how it simplified machine learning workflows. When he joined DTN, he quickly realized that Metaflow could replace their inefficient, homegrown solutions and provide a much more robust and scalable infrastructure for the data science team.

Tyler’s decision-making process was straightforward: “I had seen Metaflow in action, and it was clear that it would be better than what we were using. My goal was to implement the tool I was confident in, get it into the hands of data scientists as quickly as possible, and iterate from there.”

The key checklist of requirements for the new system included:

  • Environment Management: The new platform had to provide automated Conda environment management without requiring data scientists to manually package code.
  • Scalability: The solution had to run on AWS with the ability to scale up workloads and leverage different types of compute resources.
  • Debugging and Tracking: It had to offer an easy-to-use UI that would allow data scientists to track their experiments, monitor logs, and quickly troubleshoot issues.
  • Open-source and Community-driven: Tyler wanted to avoid proprietary systems and tap into the support and rapid innovation of open-source communities.

Metaflow checked all of these boxes and offered additional features that greatly appealed to the team. The automatic tracking of artifacts, the user-friendly UI, and seamless integration with Kubernetes made it the perfect fit for DTN’s evolving needs.

From Stable Chaos to Scalable Success

The results of adopting Metaflow at DTN were nothing short of transformative. The team quickly began to see the benefits of the platform as they transitioned away from their hand-rolled infrastructure.

  1. Dramatically Increased Scale: Before Metaflow, DTN was processing around 1,800 steps per month in their machine learning workflows. After implementation, this number skyrocketed to 20,000 steps per month—a 1000% increase in workflow capacity. Tyler highlighted that they recently crossed the milestone of 2 million steps run on Metaflow, underscoring the scale at which the platform now operates.
  2. Reduced Deployment Time: One of the most significant improvements was the reduction in model deployment time. Before Metaflow, it could take over a month to build and deploy a model to production. With Metaflow’s streamlined workflow, this time has been cut to less than a week. In some cases, models can be deployed in just a few days. This faster deployment time has allowed DTN to respond more quickly to business needs and customer demands.
  3. Lower Maintenance Overhead: Metaflow’s robust infrastructure significantly reduced the need for manual maintenance. Tyler emphasized how easy it is to manage once it’s up and running: “It just works. I don’t have to worry about it too much, and it makes our users so happy.” This has freed up operational resources to focus on more strategic tasks rather than firefighting infrastructure issues.
  4. Improved User Experience: The data science team at DTN embraced Metaflow with enthusiasm. The platform’s intuitive UI made it easy for them to track and monitor experiments, while the ability to scale compute resources without needing to involve the operations team was a huge win. According to Tyler, “Once a few of our early adopters started using it, everyone just naturally stopped using the old system.”

With the successful integration of Metaflow, DTN was able to build an end-to-end machine learning pipeline that took models from experimentation all the way to production without bottlenecks or delays.

DTN’s adoption of Metaflow has fundamentally improved the efficiency and scalability of its machine learning operations. By moving away from inefficient, homegrown solutions, DTN was able to increase the scale of their workflows by 1000%, reduce deployment times by over 75%, and double the number of models they manage in production. Metaflow’s intuitive interface, combined with its ability to seamlessly scale on AWS, has empowered DTN’s team of data scientists to focus on delivering impactful insights rather than wrestling with infrastructure challenges.

Metaflow’s flexibility, ease of use, and strong community support have made it an invaluable tool for DTN’s machine learning operations, enabling them to meet the evolving demands of their business and deliver critical data-driven insights to their customers across multiple industries.