Skip to main content

Distributed Computing

Metaflow supports distributed training of large models, such as large language models, spanning multiple instances, as well as other distributed computing algorithms that operate across a cluster of instances.

info

Open-source Metaflow supports simple distributed computing use cases on AWS Batch. If you want to use distributed training in other clouds or you have demanding use cases, Outerbounds Platform supports distributed training with Metaflow out of the box.

Configuring distributed training on AWS Batch

This section will guide you through creating a new AWS Batch compute environment for your multi-node jobs. In general, it is a good practice to separate these environments from the environments that run other jobs in your Metaflow deployment, due to the drastically different resource requirements.

The steps are:

  1. (Recommended) Read the AWS Batch Multi-node documentation.
  2. Set up a security group for passwordless SSH.
  3. Create a new Batch compute environment
    • Attach the security group from step 1
    • Attach the desired EC2 instances
    • (Optional) Configure a cluster placement group in one availability zone

Security Group for Passwordless SSH

Multi-node frameworks based on MPI require passwordless SSH. This means extra configuration in your AWS Batch compute environment is required. To enable MPI, go to the EC2 section of the AWS console and create a security group that you can then attach to your Batch compute environment where you plan to run multi-node jobs. Make sure you are in the same AWS region, and follow these steps in your AWS console to make the EC2 security group:

  • Choose Add rule
  • For Type, choose All traffic
  • For Source type, choose Custom and paste the security group ID
  • Choose Add rule
  • For Type, choose SSH
  • For Source type, choose Anywhere-IPv4
  • Choose Save rules

When creating a new Batch compute environment for your multi-node jobs, attach this security group, which will require the compute environment to be in the same VPC as the security group.

EC2 instances and GPU considerations

When you pick the instance types you want in your AWS Batch compute environment, you will need to select the desired type of EC2 instances. Many use cases, such as distributed training, call for GPU instances, so you will need to select from the GPU-enabled EC2 instance menu.

(Optional) Advanced use cases may also call for AWS HPC features like attaching Elastic Fabric Adapter (EFA) network interfaces to the instances. This requires special handling at this stage - selecting the right instances and AMI configuration - which is beyond the scope of this document. Please refer to the AWS documentation and reach out to the Outerbounds team if you need help.

Configure a Cluster Placement Group

(Optional, highly recommended) Create a cluster placement group for your Batch compute environment in a single Availability Zone and associate it with your compute resources. See the AWS documentation.

The reason to do this is that latency between nodes is much faster when all worker nodes are in the same AWS Availability Zone, which will not necessarily happen without a Cluster Placement Group.

Intranode communication with AWS Elastic Fabric Adapter (EFA)

Some AWS EC2 instances have network devices built for high-performance computing use cases. In order to use this, the AWS Batch Compute Environment that is connected the AWS Batch Job Queue needs to install the necessary EFA drivers in a Launch Template. You can see an example CloudFormation template with Launch Template that installs EFA devices here. Then, Metaflow users running can specify @batch(efa=8, ...) to attach 8 network interfaces to each node in a multi-node batch job.

Limitations

AWS Batch Multi-node is not integrated with AWS Step Functions, so you cannot use the Metaflow step functions integration when using @parallel decorators and @batch together. If you need support for production distributed training running on schedule, you can use Outerbounds Platform.