Here are key technical details about the Metaflow deployment on Azure.
Azure Resource List
|Resource Group||n/a||This will contain all resources directly created by the Terraform template.|
|Access Control||Azure Active Directory Application||This represents Metaflow as an "app" that will access various Azure resources. More info.|
|Access Control||Azure Active Directory Service Principal||This is an identity, linked to the AAD application above that will be used by the Metaflow application. More info.|
|Access Control||Azure Active Directory Service Principal Password||This will be used by Metaflow to authenticate as the service principal above.|
|Access Control||Role Assignments||Grant the service principal above sufficient access to Azure Blob Storage and AKS cluster. For specific details and conditions tied to these role assignments, please refer to the source code.|
|Networking||Virtual network||Top-level private virtual network to house all Metaflow-related Azure resources.|
|Networking||Subnets||There are two of these. One to house the PostgreSQL DB. One to house the AKS cluster. Both subnets live within the single virtual network above.|
|Storage||Azure Storage Account||Dedicated storage account for use with Metaflow|
|Storage||Azure Blob Storage container||Metaflow artifacts will be stored here. This resides within the storage account above.|
|Kubernetes||AKS cluster||There are two purposes. One is that Metaflow services run on this cluster. The other is that compute tasks from running flows will be run as pods in this cluster.|
|Kubernetes||AKS cluster node pool||A dedicated, autoscaling node pool for running services and tasks, distinct from AKS's Kubernetes control plane pods.|
|Database||Azure PostgreSQL Flexible Server||This is a PostgreSQL DB instance for indexing Metaflow run metadata.|
Required Azure Permissions for Deployment
In Azure Active Directory
Application Administrator role is required (Terraform doc). We will be creating an Active Directory Application and a related Service Principal in the relevant Active Directory (AKA "tenant").
In Azure IAM
Here is a custom role definition (JSON) containing all required permissions to manage the full lifecycle of a Metaflow-on-Azure stack using the Terraform templates. Note: "delete" type permissions are only needed for tearing down the stack ("terraform destroy").
"roleName": "Metaflow admin"
You can create a custom role as follows. From Azure Portal, Go to Subscriptions => select the right subscription ⇒ Access Control (IAM), then choose the "Create custom role" panel on RHS. Paste in the role definition JSON.
Required Azure Permissions for Running Flows
In the Azure portal, navigate to the relevant storage account/storage container. E.g. from this Terraform output:
stobmetaflowminion is the storage account,
metaflow-storage-container is the storage container.
From the container page, go to "Access Control (IAM)" to assign the role Storage Blob Data Contributor). Note this role assignment can take several minutes to propagate in our experience.
AKS Cluster Access
In Azure portal, navigate to the relevant AKS cluster. E.g. from this Terraform output:
az aks get-credentials --resource-group rg-metaflow-minion-westus --name aks-ob-metaflow-minion
aks-ob-metaflow-minion is the AKS cluster name.
Navigate to the management for this cluster, and go to "Access Control (IAM)" to assign the roles:
- Azure Kubernetes Service Contributor Role
- Azure Kubernetes Service Cluster User Role
Azure Services List
We deploy these services in the AKS cluster:
- Metadata service - this supports read/write of metadata. Supports features such as:
- When a flow is running, it POST's metadata here.
- The Metaflow Client library calls this service to read metadata.
- The UI static service serves the web UI frontend bundle.
- The UI backend supports UI's data needs.
The quickstart k8s manifest published by Argo Workflows spins up the following services:
kubectl get services -n argo
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argo-server ClusterIP 10.0.26.126 <none> 2746/TCP 32m
httpbin ClusterIP 10.0.66.229 <none> 9100/TCP 32m
minio ClusterIP 10.0.173.242 <none> 9000/TCP,9001/TCP 32m
postgres ClusterIP 10.0.51.199 <none> 5432/TCP 32m
workflow-controller-metrics ClusterIP 10.0.139.237 <none> 9090/TCP 32m