Recently, we have seen a flurry of talks and posts from major organizations building machine learning platforms with Metaflow as their backbone.
We have always believed that there can't be a one-size-fits-all approach to ML and AI, similarly as there isn't one way to use software. Rather than being prescriptive, it is useful to learn how sophisticated organizations build real-life ML/AI systems that match their business, products, team structures, and technical requirements.
We summarize three recent presentations by DeliveryHero, Ramp, and Dell below. Watch the videos for the full story!
DeliveryHero: Quick Delivery, Quicker ML
Delivery Hero recently spoke about how they leverage Metaflow for their ML platform. DeliveryHero is a major player in the food delivery industry. What sets DeliveryHero apart is its multinational presence, present in over 70 countries across four continents.
The company centralizes data processing to support various consumer platforms. This means that they handle tasks such as order management, driver location tracking, and data warehousing.
Challenges with Airflow
DeliveryHero faced challenges with Apache Airflow, a widely used orchestration tool. While it was effective for handling routine data processing tasks, it presented limitations for data scientists.
Airflow's rigid structure made it difficult for data scientists to experiment with new ideas quickly. To test a concept, they had to create new repositories, configure continuous integration and continuous delivery (CI/CD) pipelines, and adhere to specific workflow structures.
Adoption of Metaflow
To address these challenges, DeliveryHero adopted Metaflow, which they described as a tool that allows data scientists to streamline the process from idea to production, offering the flexibility to experiment both locally and in cloud environments, which is essential for a company like DeliveryHero operating in numerous countries with varying data scales.
Migration from Airflow
As DeliveryHero transitioned to Metaflow, the approach to existing projects varied. Newer projects embraced Metaflow for both experimentation and production. However, for older projects, due to resource limitations and the complexity of migrating existing workflows, Metaflow was initially used for experimentation. Airflow continued to handle production workflows, although there are plans to explore full migration.
Metaflow introduced workflow simplifications, including the shift to individual repositories for each project. This eliminated the need for a monolithic repository that caused bottlenecks and slowed down development.
Metaflow also simplified authentication, namespace management, and overall workflow efficiency. It allowed for dynamic DAGs, enhancing flexibility and efficient resource utilization, unlike the static DAGs of Airflow.
In the end, DeliveryHero addressed its data processing challenges and adopted Metaflow for a more efficient and flexible machine learning workflow.
Ramp: Accelerating ML Development to Simplify Finance
Ramp's mission is to simplify finance for businesses, including corporate cards, expense management, bill payments, and accounting integrations.
Peyton McCullough at Ramp recently wrote about how they’re leveraging Machine Learning and Metaflow to simplify finance and help thousands of businesses control spend, save time, and automate busy work.
Machine learning at Ramp
Ramp considers machine learning a core competency. They apply ML in several critical domains:
- Credit Risk: Predicting the likelihood of Ramp customers becoming delinquent, which is essential for their risk management.
- Fraud Detection: Identifying potentially fraudulent card transactions to enhance security.
- Growth: Predicting the likelihood of converting potential leads into customers, contributing to business expansion.
- Product Enhancement: Suggesting appropriate accounting codes for transactions, and streamlining financial processes.
- Ramp Intelligence: Introducing AI products designed to further enhance their financial services.
Challenges in Model Deployment
Ramp's initial experience with ML model deployment had several limitations, which hindered their development process:
- Manual Deployment: They had to manually push pipeline code to a vendor's platform before running it, causing delays.
- Slow Job Execution: Even small datasets took over an hour to process, slowing down development cycles.
- Flakiness: Job failures with unclear causes were common, leading to frustration among data scientists.
- Docker Issues: The platform didn't work well with Docker containers, resulting in unintelligible errors and limited features for containerized workloads.
Adopting Metaflow for improved efficiency
To address these challenges and improve development velocity, Ramp adopted Metaflow. The results were impressive, with Ramp being able to deploy eight additional models within just ten months, a significant improvement compared to their previous development timelines.
Key benefits of using Metaflow
Metaflow brought several advantages to Ramp's ML deployment process:
- Simplified Deployment: Flows were automatically deployed, reducing manual intervention.
- Dependency Management: Dependencies were Dockerized and standardized, enhancing reliability.
- Resource Configuration: Resources could be configured per step, allowing data scientists to allocate resources as needed.
- Enhanced Debugging: Metaflow made it easier to retry individual steps and surface logs in the user interface.
- Sharing Results: Data scientists could link to individual runs and share results using Metaflow cards, improving collaboration and visibility.
Integration with AWS and simplified workflow
Ramp's integration of Metaflow with AWS-managed services facilitated its use:
- AWS Batch: Used for scheduling and running jobs on AWS-managed ECS clusters.
- Step Functions: Employed for flow execution management, with a MetaflowOperator created to simplify triggering flows from Airflow and enhance log visibility.
- Metaflow's simplicity allowed data scientists to self-service their ML workflows, leading to increased productivity and a streamlined development process.
Dell: Enterprise-Grade Full Stack ML Platform
Dell needs no introduction. For the past four decades, Dell has been providing computers and related services to consumers and enterprises.
Thiago Ramakrishnan from Dell and Savin Goyal, CTO of Outerbounds, recently gave a talk at PyData Seattle on Enterprise-grade Full Stack ML Platforms.
They opened by discussing how software needs to both run reliably and produce correct results, even when ML and data are involved. This is easier said than done, as there is a tradeoff between doing rapid, iterative science, and building robust software. They discussed how Metaflow solves the scientist side of things and allows them to cross the “productivity gap”.
The infrastructure side needs to be solved by a platform engineer or a managed platform. Thiago asks, “do we build or buy?”, as advanced infrastructure for ML needs to be built and maintained. In a large enterprise like Dell, containers, compliance, and security are key: “how do you make sure that everything happens in a secure and safe way such that all the compliance concerns are well taken care of?”
ML outside academia
Machine learning has dramatically transformed industries. Clearly, it is no longer confined to academic research but has pervasive applications in real-world scenarios.
Thiago and Savin discussed concrete examples to illustrate the extent of this transformation, such as the use of machine learning for real-time pricing strategy adjustments and the prediction of supply chain disruptions. They also discussed the shift from traditional academic research projects to deploying machine learning models that offer immediate business value.
Needs of a large enterprise
The presentation highlighted key ML infrastructure concerns of a large enterprise:
Data exploration is a foundational step in the machine learning process. Data scientists need various tools like Jupyter Hub, R Studio, or VS Code to explore and experiment with diverse data sources.
Data consolidation is a key in large enterprises where data is distributed across various sources.
Iterative nature of machine learning - developing machine learning models is an ongoing, evolving process. The iterative approach is necessary for continuously improving the accuracy and performance of ML models, making them suitable for real-time decision-making.
Stability - platform engineers are needed to build and maintain the underlying infrastructure, guaranteeing stable operations.
Security and compliance - enterprises need proactive vulnerability detection, as exemplified by the reference to the log4j vulnerability. Compliance with regulations and data governance are critical, extending beyond model explainability to include the traceability and accountability of the data used for training models.
Want to share your ML platform story?
We host casual, biweekly Metaflow office hours on Tuesdays at 9am PT. Join us to share your story, hear from other practitioners, and learn with the community.
Also be sure to join the Metaflow community Slack where thousands of ML/AI developers discuss topics like above daily!