Metaflow is designed to support a diverse spectrum of data-intensive use cases, from operations research and classical statistics to computer vision and large language models. We don’t claim to have a one-size-fits-all solution to all these use cases but Metaflow provides a solid foundation covering projects’ fundamental needs: data, compute, orchestration, and versioning.
While the foundation is common, every company has its own team structures, infrastructure and compliance requirements, and libraries that encapsulate their unique domain knowledge. It is natural and recommended to build such company-specific customizations on top of (and around) Metaflow, so data scientists and other users can be readily productive in their home environment.
Previously, we have heard from Netflix’s media-related extensions for Metaflow. This time we are hearing from Yudhiesh Ravindranath, an MLOps Engineer at MoneyLion, who recently sat down with us to discuss how his team uses Metaflow, among many other great tools, and how it has impacted machine learning more generally at MoneyLion.
Yudhiesh and his team have built custom components on top of Metaflow, which helps them to operate it effectively in their environment. These ideas should be readily applicable to other users of AWS Batch and Step Functions, so keep on reading!
How is Metaflow deployed at MoneyLion?
At MoneyLion our AI team operates within two AWS accounts: Staging and Production. An important aspect of the design not included in the diagram above is that across both environments we allow users read access to data in Production (such as access to a Production Snowflake Warehouse).
With the entire design we wanted to accomplish the following:
- Have the main data store be shared across both environments:
- AWS S3 and PostgreSQL are the main parts that are deployed within the Production account.
- Due to each account restricting access via a VPC, we had to deploy an AWS RDS Proxy in the Staging account (more information can be found here).
- We keep separate AWS DynamoDB instances as it is only used for saving states within an AWS Step Functions run and enabling CrossAccount access to an AWS DynamoDB table was causing some issues for our Metaflow IAM Group.
- This setup has the added benefit of deduplicating data across both environments within a flow. If we went with a data store for each environment then data that is used within a Staging Run and a Production Run would not be deduplicated, thus increasing storage costs (more information about deduplication within the data store can be found here).
- Enforce strict control within flows that are run in Production while giving data scientists complete freedom to prototype and run flows within Staging.
- Staging acts as a playground environment in which data scientists can run ad-hoc or prototype their flows quickly, but the moment they need to officially deploy something to Production they would have to deploy it to the Production account.
- Flows in Staging are automatically tagged with a branch of the same name, whereas flows in Production are deployed with the –production flag.
What tools have you built on top of Metaflow and what do they do?
We’ve built 3 main tools on top of Metaflow:
- Watcher, for model monitoring and observability, along with it being an alerting platform;
- Soteria, which is a CI/CD tool;
- Metaflow Cookiecutter Template, a templated code generator.
Tell us about Watcher (your Monitoring, Observability & Alerting Platform)
Metaflow Watcher consists of the following microservices:
- Entropy Kaos System,
- Overseer, and
- Argus.
Entropy Kaos System
When working on a POC for Metaflow I faced an issue with the AWS Batch Jobs being stuck in a RUNNABLE state due to incorrect resources being set within a flow.
In the image above, I have two flows running on a schedule. In the PreprocessingExtractDataFlow, I purposely tried provisioning resources that did not exist within the AWS Batch Compute Environment. This resulted in jobs that were added to the AWS Batch Job Queue that would never start. After about a day or two of running it eventually, the AWS Batch Job Queue was filled with these jobs and caused other jobs from other flows not to run as the AWS Batch Job Queue was still awaiting to run these corrupted jobs. This issue can arise from other reasons as well, AWS has a guide on debugging this issue here.
The solution was actually pretty simple, which was to periodically check if a job is in RUNNABLE for a period greater than a fixed threshold, and cancel it. The image below shows an example of it in action, by canceling a job that was stuck for too long. I am still tuning the threshold to use, but so far 2 hours works well for my team. I will go into the system design within the section on the Metaflow Overseer.
Overseer
When running scheduled pipelines it is important to be alerted if there are any pipelines that fail, this is a feature that Metaflow lacks so I created the Metaflow Overseer to solve this. Below is an example of an alert sent to a Slack channel tagging the flow owner about the failure of their flow with important information about the failure (Log Url is a link to the flow Failure on the Metaflow UI).
Since the Entropy Kaos System and Overseer are connected, the design below showcases how they both work together where an AWS Batch Job was stuck in a RUNNABLE state for too long trigger its cancellation and sending an alert to the Slack Channel.
One flaw with this design that I later found out was that if you have a large foreach with many jobs running and some of them fail, it will send an alert for each of those failed jobs causing alert fatigue. I initially chose to trigger the alerting on the Batch Job State Change instead of the Step Functions Execution Status Change due to the Step Functions lacking sufficient information to create the message body. A fix that I am working on is to trigger it on the Step Functions Execution Status Change and then drill down to all the AWS Batch Jobs within the flow and create the message body with the information from the AWS Batch Jobs but I have not completed this as of yet.
Argus
While managing the infrastructure of a Machine Learning platform, having relevant information about the overall health of the underlying infrastructure via monitoring of the key system and business metrics is essential. Savin Goyal, the CTO of Outerbounds, shared this repository that solves this issue by creating AWS CloudWatch Dashboards leveraging information about an AWS Batch Job State Change and AWS ECS Instance (De)registration, which I had to port over to Terraform. Here is an example dashboard that tracks the AWS Batch Job State Changes:
The repository provides a detailed overview of the architecture here but I made a simpler version of it below:
Processing and sending the data from the data sources is accomplished via a Step Functions State Machine, below is an example of a State Machine for one of the dashboards:
Tell us about Soteria (your CI/CD tool)
At MoneyLion we have a strict separation of Staging and Production AWS Accounts which is aided by CI/CD systems to automate the process of running and deploying programs into either account. I decided to give our data scientists the ability to have full control over running workflows within a Staging environment but enforce the use of a CI/CD system when pushing workflows to Production. With that in mind, I created Metaflow Soteria which automates each stage in the diagram below once a pull request is made to push a flow to production. Metaflow Soteria is a CLI application that gets run within our CI/CD pipeline.
Demo:
1) When a Pull Request is made, the first part of the pipeline which is the Sanity Checks is run. Here Metaflow Soteria lints the flow and ensures that the latest Staging run of the flow was successful.
2) The Latest Staging Run Successful check fails, after fixing it and ensuring that the latest Staging run was indeed successful we can rerun the sanity checks via a command run sanity checks, which now passes.
3) The flow is ready to be deployed to AWS Step Functions. Using another command productionize flow, Soteria creates the State Machine and triggers it to run.
Tell us about Metaflow Cookiecutter Template (your templated code generator)
The Metaflow Cookiecutter Template was created with the following goals:
- Make it as easy as possible for a data scientist to create a Metaflow project
- Enable consistent and strict naming conventions of flows, projects, and tags
Demo:
1) Initialize the Repository with some project values using Cruft (enables updating):
a) View produced templated repository:
b) View the templated Flow code:
2) Run a flow via a bash script with an easy interface to run a particular command:
3) View the flow in the Metaflow UI (Easily trackable Project Name and the flow is tagged):
Join our community
If you are interested in customizing Metaflow for your environment, join Yudhiesh and over 2000 other data scientists and engineers on our community slack! Also, to hear experiences from other experts in the field, take a look at our Fireside Chats.
If you have already built interesting customizations or infrastructure around Metaflow that you’d like to share, please ping @hugo on our Slack so we can help spread the word!
Start building today
Join our office hours for a live demo! Whether you're curious about Outerbounds or have specific questions - nothing is off limits.