Recently, we have seen a lot of talks and posts from major organizations across many verticals building machine learning and AI platforms with Metaflow as their backbone.
We have always believed that there can't be a one-size-fits-all approach to ML and AI, just as there are so many ways to use software, more generally. To this end, we’ve found it super useful to see how sophisticated organizations build real-life ML/AI systems that help them meet their business needs and match their products, team structures, and technical requirements.
We were excited to hear two recent presentations by Amazon Prime and DTN below, along with an interview with Realtor.com, and so we’ve summarized them below. Watch the videos for the full story!
Behind the Screen: How Amazon Prime Video ships RecSys models 4x faster
Amazon Prime Video’s Nissan Pow recently spoke about how they use Metaflow for their ML platform to enhance experimentation velocity in the context of machine learning model development and deployment. Perhaps needing no introduction, Prime Video is a subscription video on-demand streaming and rental service of Amazon offered as a standalone service or as part of Amazon's Prime subscription. Prime Video has over 200 million users worldwide.
Content Ranking at Prime Video
Nissan’s team is responsible for managing content ranking on the Prime Video homepage. This ranking plays a crucial role in the content recommendations seen by users when they log in.
They've been using Metaflow in their production environment for over a year to improve the speed and efficiency of their machine learning model development process. The primary goal of optimizing experimentation velocity is to allow for faster iteration through different ideas and algorithms in machine learning systems.
The Challenges: Too Many APIs, Research vs Prod, Repetitive Work
The challenges they encountered in their process were categorized into three main areas:
- Lack of Established APIs: Each machine learning model had its own API, making it challenging to swap models. Adapting models to the existing interfaces in their pipelines was difficult.
- Disparity Between Research and Production Environments: Research used Python code, while production used Scala code and different libraries, leading to potential discrepancies in model performance between the two environments.
- Repetitive Manual Effort: Due to the complexity of the systems they’re building, scientists and engineers had to put in a lot of manual effort in tasks like code reviews, setting up pipelines, and monitoring, which resulted in slower progress and onboarding for new team members.
Configurable Modular Flows with Metaflow
To address these challenges, they introduced a solution called "configurable modular flows" with the following components:
- They use Metaflow for their flows, and all flows have a single set of parameters that can be configured through a variety of config files. Users can specify the desired flow and provide the necessary configuration, minimizing the need for code writing. Configuration files dictate everything in the machine learning pipeline, from data loading and feature processing to model choice and hyperparameters. They offer standardized or common flows out of the box. Users can simply select a flow and configure it through their config files, eliminating the need to write extensive code.
- They developed a CLI utility, "ether," which enables users to select a flow and run it using the provided configurations. This simplifies the process further.
- Configurations are handled using Omega conf, which is a hierarchical YAML-based configuration tool. It can merge and override configurations, providing a flexible way to manage settings.
The Modularity of Metaflow at Prime Video
The approach is modular, and they have defined APIs for common steps in the machine learning pipeline, making models easily swappable. Evaluation and inference are also standardized for ease of use. Users can customize steps within Metaflow tasks, allowing flexibility in case they need to make specific adjustments. They also addressed issues with running tasks on the same machine and transitioned to using a separate batch queue for shorter tasks, improving resource utilization.
The deployment process involved using an internal code repository, which required some adjustments when deploying from different environments. There were also plans to simplify onboarding and validation for configurations. They intend to continue improving the system by adding features such as SageMaker support, enabling distributed training, and making onboarding more efficient. This is part of their effort to streamline the transition from research to production, allowing more teams to leverage the solution they have developed.
Fuelling Decisions: How DTN Powers Gas Pricing and Data Science Collaboration
Data Science and Machine Learning at DTN
DTN recently spoke about how they utilize Metaflow for data science collaboration and decision-making. DTN specializes in subscription-based services for the analysis and delivery of real-time weather, agricultural, energy, and commodity market information. They primarily focus on three key areas: weather, fuel, and agriculture. They use data science and ML to work on various projects within these domains:
- For instance, one project is "Storm Impact Analytics," where DTN predicts potential weather-related impacts on infrastructure, enabling customers to allocate resources more efficiently.
- Another project centers on "Crop Yield Modeling," which involves ingesting satellite imagery and past harvest data to create models for estimating crop yields. This data aids in agricultural decision-making.
- DTN is also involved in "Fuel Demand Modeling," focusing on the last leg of fuel transport, such as from tankers to gas stations. Using data, they can predict demand fluctuations over time, as exemplified by changes in demand during the COVID-19 pandemic.
How Metaflow is Used at DTN
Metaflow is a central component of DTN's data science work and is accessed through Jupyter Hub, which offers a preconfigured computational environment:
- Users can gain access by enabling their email and using Single Sign-On (SSO) credentials. This environment provides shared resources and access to substantial computing power.
- To ensure a consistent setup, Metaflow is configured via environment variables. This configuration simplifies the process for data scientists and keeps the setup uniform across different users.
- Metaflow is preconfigured, with environment variables set to accommodate DTN's specific requirements. They also manage Kubernetes integration, ensuring that users can interact seamlessly with Kubernetes clusters upon login.
This presentation also showcased the GitLab CI (Continuous Integration) pipeline, which is used to deploy Metaflow projects to Argo workflows for production. This pipeline is designed in collaboration with data scientists to meet their needs.
KubeCost is utilized to monitor costs. It tracks resource consumption for each project and maps tags to Kubernetes objects, facilitating cost tracking. This allows data scientists to view resource costs, CPU, RAM usage, and more.
Data Scientist Adoption of the Stack
The adoption of these tools varies among data scientists. Some are early adopters and embrace the tools enthusiastically, while others might find them more challenging to work with. A collaborative approach has resulted in the development of a streamlined workflow that accommodates both usage patterns and ensures consistency and efficiency in deploying Metaflow projects for data science at DTN.
Case Study: How Realtor.com uses Metaflow
As we’ve just seen how Amazon Prime Video utilizes Metaflow and AWS, we thought to highlight another large-scale example running on AWS, which we published last year: Realtor.com, a long-time Metaflow user! Russell Brooks, Principal Machine Learning Engineer at Realtor.com, shared insights about their usage of Metaflow and its impact on machine learning at Realtor.com.
Types of ML Questions and Business Questions
Realtor.com tackles a wide range of machine learning (ML) use cases, covering various modeling techniques. These use cases include
- Consumer-facing website optimizations like recommendation systems, search ranking, consumer segmentation, and personalization to help users find relevant homes.
- Additionally, they employ image and natural language processing (NLP) models to enhance property content.
- The ML work also involves forecasting housing trends, strategic planning, sales/pricing optimization models, surrogate experimentation metrics for AB testing, and dashboards.
- Another aspect is creating models to match consumers with the best-suited real estate professionals based on their needs.
The ML models produce billions of predictions daily, impacting various aspects of the business, from empowering consumers in their home-buying journey to aiding real estate professionals.
ML Stack Before Metaflow
Before adopting Metaflow, Realtor.com had a mixed set of AWS services manually connected:
- This setup led to challenges, including the lack of robust infrastructure-as-code solutions.
- Creating state machines was laborious, and there wasn't built-in support for Batch jobs as standalone steps.
- Managing state transfer and data sharing between steps needed improvement.
Some early tech stack decisions have stood the test of time. They include
- A cloud-first approach, creating a dedicated AWS account for the ML/Data Science team, and setting clear rules of engagement for batch/real-time model deployments.
- The emphasis on simplicity and vertical scaling over distributed approaches proved to be effective.
In terms of model deployments,
- Batch deployments cover the majority of ML use cases, with ML/DS teams taking end-to-end ownership.
- For real-time model deployments, these are typically handed off to engineering teams responsible for the specific services.
Adopting New ML Technologies
When considering new ML technologies, various factors come into play, such as developer experience, community support, self-service capabilities, and cost-effectiveness.
Open source options are often considered, and the presence of a free tier for experimentation can be crucial.
Why Metaflow was Chosen
The introduction of Metaflow piqued interest due to its Pythonic abstractions and clean interfaces for wiring up AWS services.
Key benefits included
- metadata tracking,
- streamlined state transfer, and
- a reduced gap between development and production.
Metaflow proved to be a great fit for existing ML infrastructure and offered a user-friendly experience.
Russell also talked about what his team discovered about Metaflow when they started using it:
- Metaflow's AWS abstractions and workflow orchestration components significantly improved productivity.
- Metadata tracking, debugging workflows, and namespace isolation became essential for smoother development.
- The Metaflow UI, cards, and metadata tracking have proven valuable for monitoring and reporting.
Impact of Metaflow at Realtor.com
Metaflow streamlined the development experience and reduced experimentation friction, leading to faster pipeline development.
Collaboration across teams improved, and Metaflow became a backbone for ML projects, enabling rapid prototyping and consistent coordination of models.
Realtor.com benefits from quicker iterations and value delivery through Metaflow's user-friendly and efficient features.
Want to share your ML platform story?
We host casual, biweekly Metaflow office hours on Tuesdays at 9am PT. Join us to share your story, hear from other practitioners, and learn with the community.
Also be sure to join the Metaflow community Slack where many thousands of ML/AI developers discuss topics like the above daily!