Skip to main content

Kubernetes on Azure - Details

Here are key technical details about the Metaflow deployment on Azure.

Architecture Diagram

Azure Resource List

CategoryResourcesPurpose
Resource Groupn/aThis will contain all resources directly created by the Terraform template.
Access ControlAzure Active Directory ApplicationThis represents Metaflow as an "app" that will access various Azure resources. More info.
Access ControlAzure Active Directory Service PrincipalThis is an identity, linked to the AAD application above that will be used by the Metaflow application. More info.
Access ControlAzure Active Directory Service Principal PasswordThis will be used by Metaflow to authenticate as the service principal above.
Access ControlRole AssignmentsGrant the service principal above sufficient access to Azure Blob Storage and AKS cluster. For specific details and conditions tied to these role assignments, please refer to the source code.
NetworkingVirtual networkTop-level private virtual network to house all Metaflow-related Azure resources.
NetworkingSubnetsThere are two of these. One to house the PostgreSQL DB. One to house the AKS cluster. Both subnets live within the single virtual network above.
StorageAzure Storage AccountDedicated storage account for use with Metaflow
StorageAzure Blob Storage containerMetaflow artifacts will be stored here. This resides within the storage account above.
KubernetesAKS clusterThere are two purposes. One is that Metaflow services run on this cluster. The other is that compute tasks from running flows will be run as pods in this cluster.
KubernetesAKS cluster node poolA dedicated, autoscaling node pool for running services and tasks, distinct from AKS's Kubernetes control plane pods.
DatabaseAzure PostgreSQL Flexible ServerThis is a PostgreSQL DB instance for indexing Metaflow run metadata.

Required Azure Permissions for Deployment

In Azure Active Directory

Application Administrator role is required (Terraform doc). We will be creating an Active Directory Application and a related Service Principal in the relevant Active Directory (AKA "tenant").

In Azure IAM

Here is a custom role definition (JSON) containing all required permissions to manage the full lifecycle of a Metaflow-on-Azure stack using the Terraform templates. Note: "delete" type permissions are only needed for tearing down the stack ("terraform destroy").

{
"id": "<REDACTED>",
"properties": {
"assignableScopes": [
"/subscriptions/<YOUR_SUBSCRIPTION_ID>"
],
"description": "",
"permissions": [
{
"actions": [
"Microsoft.Resources/subscriptions/resourceGroups/read",
"Microsoft.Resources/subscriptions/resourceGroups/write",
"Microsoft.Network/privateDnsZones/read",
"Microsoft.Network/privateDnsZones/write",
"Microsoft.Network/privateDnsZones/SOA/read",
"Microsoft.Storage/storageAccounts/read",
"Microsoft.Network/virtualNetworks/read",
"Microsoft.Network/virtualNetworks/write",
"Microsoft.Storage/storageAccounts/write",
"Microsoft.Network/virtualNetworks/subnets/read",
"Microsoft.Network/virtualNetworks/subnets/write",
"Microsoft.Storage/storageAccounts/listkeys/action",
"Microsoft.Storage/storageAccounts/blobServices/read",
"Microsoft.Storage/storageAccounts/blobServices/write",
"Microsoft.Storage/storageAccounts/fileServices/read",
"Microsoft.ContainerService/managedClusters/read",
"Microsoft.ContainerService/managedClusters/write",
"Microsoft.Network/virtualNetworks/subnets/join/action",
"Microsoft.ContainerService/managedClusters/accessProfiles/listCredential/action",
"Microsoft.Network/privateDnsZones/virtualNetworkLinks/read",
"Microsoft.Authorization/roleAssignments/read",
"Microsoft.ContainerService/managedClusters/agentPools/read",
"Microsoft.ContainerService/managedClusters/agentPools/write",
"Microsoft.Network/privateDnsZones/virtualNetworkLinks/write",
"Microsoft.Authorization/roleAssignments/write",
"Microsoft.Network/virtualNetworks/join/action",
"Microsoft.DBforPostgreSQL/flexibleServers/read",
"Microsoft.DBforPostgreSQL/flexibleServers/write",
"Microsoft.DBforPostgreSQL/flexibleServers/databases/read",
"Microsoft.DBforPostgreSQL/flexibleServers/databases/write",
"Microsoft.DBforPostgreSQL/flexibleServers/configurations/read",
"Microsoft.DBforPostgreSQL/flexibleServers/configurations/write",
"Microsoft.ContainerService/managedClusters/listClusterUserCredential/action",
"Microsoft.Authorization/roleAssignments/delete",
"Microsoft.DBforPostgreSQL/flexibleServers/databases/delete",
"Microsoft.ContainerService/managedClusters/agentPools/delete",
"Microsoft.Storage/storageAccounts/delete",
"Microsoft.ContainerService/managedClusters/delete",
"Microsoft.DBforPostgreSQL/flexibleServers/delete",
"Microsoft.Network/virtualNetworks/subnets/delete",
"Microsoft.Network/privateDnsZones/virtualNetworkLinks/delete",
"Microsoft.Network/virtualNetworks/delete",
"Microsoft.Network/privateDnsZones/delete",
"Microsoft.Resources/subscriptions/resourceGroups/delete"
],
"dataActions": [],
"notActions": [],
"notDataActions": []
}
],
"roleName": "Metaflow admin"
}
}

You can create a custom role as follows. From Azure Portal, Go to Subscriptions => select the right subscription ⇒ Access Control (IAM), then choose the "Create custom role" panel on RHS. Paste in the role definition JSON.

Required Azure Permissions for Running Flows

Storage Access

In the Azure portal, navigate to the relevant storage account/storage container. E.g. from this Terraform output:

METAFLOW_AZURE_STORAGE_BLOB_SERVICE_ENDPOINT=https://stobmetaflowminion.blob.core.windows.net/
METAFLOW_DATASTORE_SYSROOT_AZURE=metaflow-storage-container/tf-full-stack-sysroot

stobmetaflowminion is the storage account, metaflow-storage-container is the storage container.

From the container page, go to "Access Control (IAM)" to assign the role Storage Blob Data Contributor). Note this role assignment can take several minutes to propagate in our experience.

AKS Cluster Access

In Azure portal, navigate to the relevant AKS cluster. E.g. from this Terraform output:

az aks get-credentials --resource-group rg-metaflow-minion-westus --name aks-ob-metaflow-minion

aks-ob-metaflow-minion is the AKS cluster name.

Navigate to the management for this cluster, and go to "Access Control (IAM)" to assign the roles:

  • Azure Kubernetes Service Contributor Role
  • Azure Kubernetes Service Cluster User Role

Azure Services List

We deploy these services in the AKS cluster:

Metaflow

  • Metadata service - this supports read/write of metadata. Supports features such as:
    • When a flow is running, it POST's metadata here.
    • The Metaflow Client library calls this service to read metadata.
  • The UI static service serves the web UI frontend bundle.
  • The UI backend supports UI's data needs.

Argo Workflows

The quickstart k8s manifest published by Argo Workflows spins up the following services:

kubectl get services -n argo
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argo-server ClusterIP 10.0.26.126 <none> 2746/TCP 32m
httpbin ClusterIP 10.0.66.229 <none> 9100/TCP 32m
minio ClusterIP 10.0.173.242 <none> 9000/TCP,9001/TCP 32m
postgres ClusterIP 10.0.51.199 <none> 5432/TCP 32m
workflow-controller-metrics ClusterIP 10.0.139.237 <none> 9090/TCP 32m