Recently, we have seen many talks from major organizations across many verticals building machine learning and AI platforms with Metaflow as their backbone.
We were excited to hear three recent presentations by Adept.ai, Autodesk, and Epignosis, all of which focused on the need for high-performance computing for AI and ML. We’ve summarized them below but do watch the videos for the full story! This is also timely as we just released new Metaflow features for distributed high-performance computing in general and large-scale training for AI in particular.
Crafting General Intelligence: LLM Fine-tuning with Metaflow at Adept.ai
Adept.ai’s Rahul Parundekar recently spoke about how Adept uses Metaflow for fine-tuning LLMs. Adept is an ML research and product lab building general intelligence by enabling humans and computers to work together creatively.
We’ve recently been doing a lot of work with Metaflow on fine-tuning LLMs and using RAG so this is timely. We also recently taught a workshop at the Generative AI Summit in Austin on “LLMs in Practice: A Guide to Recent Trends and Techniques”, which you can check out here!
LLMs, Metaflow, and Kubernetes at Adept
Rahul spoke about Adept.ai's use of Metaflow in combination with Argo on Kubernetes for fine-tuning and monitoring infrastructure. This setup allows them to efficiently manage machine learning and LLM workflows.
Adept.ai's primary goal is to develop transformers capable of interacting with various applications on a computer and they’re a leader in multimodal models. Rahul presented an example of this technology through a browser extension that can perform tasks based on user instructions, showcasing the potential of their model.
Initially, Adept.ai used Slurm, a cluster management and job scheduling system, for training and fine-tuning machine learning models. Slurm is efficient and is a standard for general HPC workload management but not specifically designed for machine learning and AI tasks. However, it served Adept well in the early stages.
Why was Metaflow so useful?
A significant challenge arose when Adept.ai aimed to create self-serve fine-tuning pipelines. These pipelines involved multiple configurations, leading to complex code and various execution steps. This complexity made it difficult for new team members to make changes effectively.
To address this challenge, they explored migrating their workloads to Kubernetes, considering various workflow solutions. They evaluated four options, two of which involved Kubernetes and Argo as a workflow orchestration system. They chose Metaflow and Argo primarily because of Argo's robust and well-tested workflow orchestration system. This choice aligned well with their desire to use Kubernetes for their workloads.
Challenges with transitioning… and solutions!
Adept.ai encountered several challenges during the transition. Firstly, they needed to simplify their codebase, which was complex and had configuration files scattered across multiple locations. Untangling the code and identifying logical steps to plug into Metaflow workflows took significant effort.
Containerization of their codebase proved challenging due to the large code repository, which had grown in size over time. Storing versions of models and data within the codebase inflated the repository size to almost 1GB, complicating the containerization process.
They developed an internal Adept CLI to streamline job launches. This CLI not only simplified job execution but also provided information about the submitted jobs, such as their status and log locations.
In addition to fine-tuning and evaluation, Adept.ai uses Metaflow and Argo for monitoring their infrastructure. This includes handling housekeeping jobs, maintaining nightlies, and managing infrastructure components like Slurm.
While they have successfully implemented Metaflow and Argo on Kubernetes for self-serve fine-tuning and other workflows, there are ongoing challenges. These include queuing flows for workflows that require more nodes than they currently have, as well as creating and launching parameterized workflows in a seamless manner.
In conclusion, Rahul emphasized that Metaflow combined with Argo on Kubernetes is effective for their workflow needs, enabling self-serve fine-tuning pipelines and infrastructure monitoring.
Building a GenAI-Ready ML Platform with Metaflow at Autodesk
Autodesk is a global software provider renowned for its design solutions across various industries including architecture, manufacturing, education, 3D art, entertainment, and more.
Autodesk has been collaborating with us at Outerbounds to develop the next generation of their large-scale ML and AI infrastructure, Autodesk Machine Learning Platform (AMLP). Riley Hun joined Metaflow Office Hours to talk about their recent work.
Choosing Metaflow for Full-Stack Machine Learning and AI
Autodesk's AMLP team evaluated various orchestration tools and chose Metaflow as its primary foundation due to its versatility. Metaflow serves multiple purposes within its platform, including data orchestration, compute management, versioning, and more. They loved that Metaflow is more than a point solution in the ML stack and cuts across the full stack of ML and AI!
One of the primary reasons for choosing Metaflow is its compatibility with AWS services. Autodesk uses AWS and Metaflow seamlessly integrates with various AWS-managed services. Specifically, it leverages AWS Batch for scaling out workflows, making it easier for users to build and train machine learning models at scale.
Metaflow's ability to create reproducible experiments is a significant advantage for Autodesk. It ensures that all aspects of a workflow, including flow runs, data snapshots, and artifacts, are tracked and maintained. This reproducibility is crucial for data lineage and tracking experiments.
The Supreme Importance of Developer UX
AMLP at Autodesk places a strong emphasis on providing a user-friendly experience (UX) to encourage the adoption of their tools. They understand that some users already have their bespoke ML platforms, and for them to transition, AMLP needs to offer a seamless and appealing UX. Their platform, accessible through SageMaker Studio, provides a basic UI for user authentication and management, and it enables users to spin up their personal Studio instances. These instances are organized by team and are controlled via SSO.
AMLP has integrated Metaflow with SageMaker Studio to simplify user interactions. Users can create and orchestrate ML pipelines, run reproducible experiments, and monitor workflows via the Studio UI. Additionally, users can import their data, which is accessible by both Studio and Metaflow.
Autodesk has rolled out this training infrastructure to 50 users and offers a straightforward UI for users to authenticate and set up their Studio instances. These instances are organized by team, and the login process is managed via SSO. A custom security-hardened image is used to equip users with all the necessary tools to run Metaflow jobs, and they can easily run Metaflow from within the notebook cell.
AMLP's infrastructure runs on AWS services, primarily AWS Batch for handling compute resources. They have ensured the security of each component and incorporated alerting through Slack channels. A custom GPU monitoring GUI is available for tracking GPU and CPU utilization, and it is integrated into SageMaker Studio, offering a unified user experience.
Autodesk’s Ray-Metaflow Integration for Massive Compute Needs and Distributed Training
AMLP at Autodesk has not only utilized Metaflow for managing typical ML workflows but has also explored its integration with Ray, a distributed computing framework. This allows users to create Ray clusters using AWS Batch for multi-node parallel jobs. The results of these tests show the potential of scaling training jobs with GPU nodes effectively. What’s more is they can scale and use PyTorch, DeepSpeed, HuggingFace, and Tensorflow!
In addition to integrating Ray, Autodesk has addressed performance bottlenecks associated with internode communication in distributed training by leveraging the AWS networking feature called Elastic Fabric Adapter (EFA). Users can attach multiple EFA network devices to EC2 instances with A100 GPUs, significantly improving communication and data transfer efficiency.
Autodesk has also implemented a high-performance parallel file system for simultaneous access from multiple nodes in the HPC cluster. They've integrated Metaflow with FSx for Lustre by using Metaflow's mounting host volume feature, making it possible for the batch queue decorator to specify host volumes and target mount paths.
Autodesk's integration of Metaflow, AWS Batch, EFA, and FSx for Lustre is detailed in their tech blog, showcasing their dedication to creating a high-performance, scalable, and secure machine learning platform for their organization.
Media Transcoding for 10 Million Users and Beyond with Metaflow at Epignosis
Epignosis’ Chrysostomos Galatoulas spoke about how they’ve effectively harnessed Metaflow to transcode media files, serving a vast and global user base of over 11 million users. Epignosis is a leading company in the learning technology sector, providing products for corporate training and workforce management.
The Challenges of Large-Scale Media Transcoding Challenges
Epignosis deals with a large volume of video and audio content daily, with around 5,000 jobs (files uploaded by customers) to process daily. The company offers unlimited storage to its customers, which can lead to significant storage costs. Their existing solution, developed in 2013, was showing signs of strain as the company grew, and they had to deal with the challenge of maintaining it and the risk associated with external contractors.
Why Metaflow and How Epignosis Uses It
The company needed a new solution for media transcoding that was cost-effective, scalable, and easy to maintain. They chose Metaflow after being inspired by its use in other projects, realizing its potential beyond machine learning tasks.
Their use of Metaflow is relatively straightforward. They re-encode each uploaded file to h264 format, make sure it is compatible with various platforms, downscale video if needed, and add watermarks, all achievable using FFmpeg. Their setup relies on AWS services, using Cloudformation for maintaining infrastructure.
Lambda functions trigger containerized workflows running FFmpeg through AWS API Gateway. To lower costs, they are able to utilize spot instances. Interruptions of spot instances are not an issue as tasks are short-running and hence cheap to retry.
Benefits of Metaflow
Epignosis has experienced several benefits from using Metaflow:
- Their workflows became easy to replicate, making conventions predictable.
- They quickly went from the idea stage to an alpha version of their solution, within just two weeks.
- Onboarding new team members and developers became straightforward.
- They have had no unexpected issues since implementing Metaflow.
- Infrastructure maintenance is no longer a concern, as it works seamlessly with AWS resources.
- Development is fast and can be exported as AWS Step Functions, which aids in their transition to cloud resources.
- The ease of development and enthusiasm among developers have been infectious and valuable for the company.
Future Projects with Metaflow
- Epignosis is exploring the extension of their media transcoding capabilities to include redubbing content in other languages for video localization.
- They are using Metaflow for stable diffusion projects (see our previous post about Metaflow with Stable Diffusion).
- The company is considering using Metaflow for large-scale report creation, which would be beneficial for customers with a large number of learners requiring detailed progress reports.
Want to share your ML platform story?
We host casual, biweekly Metaflow office hours on Tuesdays at 9am PT. Join us to share your story, hear from other practitioners, and learn with the community.
Also be sure to join the Metaflow community Slack where many thousands of ML/AI developers discuss topics like the above daily!