Skip to main content

Kubernetes on Google Cloud - FAQ

What's with all the port forwards?

See here. We publish a short utility script to open up the port forwards and keep them open by preventing inactivity timeouts here.

I want to use an existing storage account/container, an existing GKE, or DB. What to do?

The terraform templates provide an all-in experience, spinning up all fresh resources. The templates need to be adapted to accommodate existing resources. E.g. to bring your own storage account and storage container, here are some possible approaches:

Full manual reference replacement:

  • Look for all references to metaflow_storage_bucket

  • Replace references with known attributes of the account and container you are bringing. E.g. storage_account_name, or container_name, etc.

  • Remove terraform resources for metaflow_storage_bucket from templates.

  • Terraform "data source" approach:*

  • Replace the resources metaflow_storage_bucket with "data sources". i.e. storage_bucket

  • Update references to these resources, so that the callers refer to the "data source" object instead. I.e. data.storage_bucket.<name>

How do I use GPU nodes in my GKE cluster?

Our quickstart terraform templates will not support that out of box. They may be extended in the future. The approach would likely implement this in an automated manner within the templates.

How do I change the VM instance types for Metaflow task runner nodes?

The quickstart terraform templates provided use GKE's node auto-provisioning, out of the box. Node auto provisioning should be aware of the resource requirements of incoming pods, which means it will spin up appropriately sized instances as needed.

I want finer grain auth on running flows. What to do?

See here.

Why is my Metaflow flow stuck on k8s pending state forever?

When Metaflow submits tasks to Kubernetes for execution, there are two scenarios:

  • There are sufficient spare resources to immediately run the task.
  • There are NOT sufficient spare resources right now. GKE autoscaler provisions additional compute nodes to satisfy the requirements of the new task. Once provisioning is complete, the task is run.

If GKE autoscaler (running node auto-provisioning) can never satisfy the new task and the task will be stuck in pending forever. E.g. if the aggregate hard CPU or memory limit has been reached on the GKE cluster. To resolve the aggregate limit issue, up the limit in the templates.

Need Help?

The quickest way to get help is our public Slack channel #ask-metaflow. We look forward to your questions and feedback.