Skip to main content

Kubernetes on Google Cloud - Details

Here are key technical details about the Metaflow deployment on Google Cloud.

Architecture Diagram

GCP Resource List

CategoryResourcesPurpose
Access ControlService accountThis is an identity that has all required permissions to run Metaflow workloads, either locally vs Google Cloud Storage, or all the way running in the GKE cluster. More info.
Access ControlService account keyThis will be used by Metaflow to authenticate as the service account above. Note: This is needed for local runs as well as for Metaflow logic prior to workload tasks to GKE. For all GCP accesses from within a GKE pod, this credential is not required.
Access ControlRole AssignmentsGrants the service account above sufficient access to: Google Cloud Storage, GKE, and Cloud SQL (PostgreSQL). For specific details and conditions tied to these role assignments, please refer to the source code.
NetworkingVirtual networkTop-level private virtual network to house all Metaflow-related GCP resources.
NetworkingSubnetTo house the PostgreSQL DB
StorageGoogle Cloud Storage bucketMetaflow artifacts will be stored here. This resides within the storage account above.
KubernetesGKE clusterThis has built-in compute node autoscaling. There are two purposes. First, Metaflow services run on this cluster. Second, compute tasks from running flows will be run as pods in this cluster.
DatabaseCloud SQL instanceThis is a PostgreSQL DB instance for indexing Metaflow run metadata.

Required GCP Permissions for Deployment

The permissions required can be described by the following custom role (gcloud iam roles describe output):

description: <DESCRIPTION>
includedPermissions:
- cloudsql.backupRuns.create
- cloudsql.backupRuns.delete
- cloudsql.backupRuns.get
- cloudsql.backupRuns.list
- cloudsql.databases.create
- cloudsql.databases.delete
- cloudsql.databases.get
- cloudsql.databases.list
- cloudsql.databases.update
- cloudsql.instances.addServerCa
- cloudsql.instances.clone
- cloudsql.instances.connect
- cloudsql.instances.create
- cloudsql.instances.createTagBinding
- cloudsql.instances.delete
- cloudsql.instances.deleteTagBinding
- cloudsql.instances.demoteMaster
- cloudsql.instances.export
- cloudsql.instances.failover
- cloudsql.instances.get
- cloudsql.instances.import
- cloudsql.instances.list
- cloudsql.instances.listEffectiveTags
- cloudsql.instances.listServerCas
- cloudsql.instances.listTagBindings
- cloudsql.instances.login
- cloudsql.instances.promoteReplica
- cloudsql.instances.resetSslConfig
- cloudsql.instances.restart
- cloudsql.instances.restoreBackup
- cloudsql.instances.rotateServerCa
- cloudsql.instances.startReplica
- cloudsql.instances.stopReplica
- cloudsql.instances.truncateLog
- cloudsql.instances.update
- cloudsql.sslCerts.create
- cloudsql.sslCerts.delete
- cloudsql.sslCerts.get
- cloudsql.sslCerts.list
- cloudsql.users.create
- cloudsql.users.delete
- cloudsql.users.get
- cloudsql.users.list
- cloudsql.users.update
- compute.globalAddresses.createInternal
- compute.globalAddresses.deleteInternal
- compute.globalAddresses.get
- compute.instanceGroupManagers.get
- compute.networks.create
- compute.networks.delete
- compute.networks.get
- compute.networks.removePeering
- compute.networks.updatePolicy
- compute.networks.use
- compute.subnetworks.create
- compute.subnetworks.delete
- compute.subnetworks.get
- container.clusterRoleBindings.create
- container.clusterRoleBindings.delete
- container.clusterRoleBindings.get
- container.clusterRoleBindings.list
- container.clusterRoleBindings.update
- container.clusterRoles.bind
- container.clusterRoles.create
- container.clusterRoles.delete
- container.clusterRoles.escalate
- container.clusterRoles.get
- container.clusterRoles.list
- container.clusterRoles.update
- container.clusters.create
- container.clusters.delete
- container.clusters.get
- container.configMaps.create
- container.configMaps.delete
- container.configMaps.get
- container.configMaps.list
- container.configMaps.update
- container.customResourceDefinitions.create
- container.customResourceDefinitions.delete
- container.customResourceDefinitions.get
- container.customResourceDefinitions.getStatus
- container.customResourceDefinitions.list
- container.customResourceDefinitions.update
- container.customResourceDefinitions.updateStatus
- container.deployments.create
- container.deployments.delete
- container.deployments.get
- container.deployments.getScale
- container.deployments.getStatus
- container.deployments.list
- container.deployments.rollback
- container.deployments.update
- container.deployments.updateScale
- container.deployments.updateStatus
- container.namespaces.create
- container.namespaces.delete
- container.namespaces.finalize
- container.namespaces.get
- container.namespaces.getStatus
- container.namespaces.list
- container.namespaces.update
- container.namespaces.updateStatus
- container.operations.get
- container.priorityClasses.create
- container.priorityClasses.delete
- container.priorityClasses.get
- container.priorityClasses.list
- container.priorityClasses.update
- container.roleBindings.create
- container.roleBindings.delete
- container.roleBindings.get
- container.roleBindings.list
- container.roleBindings.update
- container.roles.bind
- container.roles.create
- container.roles.delete
- container.roles.escalate
- container.roles.get
- container.roles.list
- container.roles.update
- container.secrets.create
- container.secrets.delete
- container.secrets.get
- container.secrets.list
- container.secrets.update
- container.serviceAccounts.create
- container.serviceAccounts.createToken
- container.serviceAccounts.delete
- container.serviceAccounts.get
- container.serviceAccounts.list
- container.serviceAccounts.update
- container.services.create
- container.services.delete
- container.services.get
- container.services.getStatus
- container.services.list
- container.services.proxy
- container.services.update
- container.services.updateStatus
- edgecontainer.clusters.create
- iam.serviceAccountKeys.create
- iam.serviceAccountKeys.get
- iam.serviceAccounts.actAs
- iam.serviceAccounts.create
- iam.serviceAccounts.delete
- iam.serviceAccounts.get
- iam.serviceAccounts.getIamPolicy
- iam.serviceAccounts.list
- iam.serviceAccounts.setIamPolicy
- resourcemanager.projects.get
- resourcemanager.projects.setIamPolicy
- servicenetworking.services.addPeering
- servicenetworking.services.get
- storage.buckets.create
- storage.buckets.delete
- storage.buckets.get
- storage.objects.delete
- storage.objects.list
name: projects/<PROJECT>/roles/metaflow_admin
stage: GA
title: Metaflow admin

Required GCP Permissions for Running Flows

Kubernetes Engine Developer Role

Note: as of Q3, 2022, there is no direct way to scope this to a specific GKE cluster.

Storage Object Admin Role

This should be granted under this IAM condition:

resource.name.startsWith("projects/_/buckets/<BUCKET_NAME>")

The bucket name can be found from the end user output from Terraform run. For example,


"METAFLOW_DATASTORE_SYSROOT_GS": "gs://ob-metaflow-storage-bucket-ci/tf-full-stack-sysroot",

GKE services list

We deploy these services in the GKE cluster:

Metaflow

  • Metadata service - this supports read/write of metadata. Supports features such as:
    • When flow is running, it POST's metadata here.
    • Metaflow Client library calls this service to read metadata.
  • The UI static service serves the web UI frontend bundle.
  • The UI backend supports UI's data needs.

Argo Workflows

The quickstart Kubernetes manifest published by Argo Workflows spins up the following services:

kubectl get services -n argo
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argo-server ClusterIP 10.0.26.126 <none> 2746/TCP 32m
httpbin ClusterIP 10.0.66.229 <none> 9100/TCP 32m
minio ClusterIP 10.0.173.242 <none> 9000/TCP,9001/TCP 32m
postgres ClusterIP 10.0.51.199 <none> 5432/TCP 32m
workflow-controller-metrics ClusterIP 10.0.139.237 <none> 9090/TCP 32m