Skip to main content

Kubernetes on Azure - FAQ

What's with all the port forwards?

See here. We publish a short utility script to open up the port forwards and keep them open by preventing inactivity timeouts here.

I want to use an existing storage account/container, an existing AKS, or DB. What to do?

The terraform templates provide an all-in experience, spinning up all fresh resources. The templates need to be adapted to accommodate existing resources. E.g. to bring your own storage account and storage container, here are some possible approaches:

Full manual reference replacement

  • Look for all references to the terraform resources:
    • metaflow_storage_account
    • metaflow_storage_container
  • Replace references with known attributes of the account and container you are bringing. E.g. storage_account_name, or container_name, etc.
  • Remove terraform resources for metaflow_storage_account and metaflow_storage_container from templates.

Terraform "data source" approach

  • Replace the resources metaflow_storage_account and metaflow_storage_container with "data sources". I.e. storage_container, storage_account.
  • Update references to these resources, so that the callers refer to the "data source" object instead. I.e. data.storage_account.<name>.

How do I use GPU nodes in my AKS cluster?

Our quickstart terraform templates will not support that out of the box. They may be extended in the future. The approach would likely implement this in an automated manner within the templates.

How do I change the VM instance types for the AKS control plane as well as for Metaflow task runner nodes?

Change these lines (control plane, tasks) and reapply the Terraform template ("terraform apply -target=module.infra")

I want finer grain auth on running flows. What to do?

See here.

Why is my Metaflow flow stuck on k8s pending state forever?

When Metaflow submits tasks to Kubernetes for execution, there are two scenarios:

  • There are sufficient spare resources to immediately run the task.
  • There are NOT sufficient spare resources right now. AKS autoscaler provisions additional compute nodes to satisfy the requirements of the new task. Once provisioning is complete, the task is run.

The AKS autoscaler is configured with a specific instance type (defaults to Standard_D8_v5) to provision if needed. If a task's resource requirements exceed the VM size, upscaling can never satisfy the new task and the task will be stuck in pending forever.

When this is suspected, double-check your resource requirements vs the VM instance type used in the taskworker node pool.

Can I do the deployment quickstart on Windows?

Yes, all the CLI tools involved work on Windows natively.

Need Help?

The quickest way to get help is our public Slack channel #ask-metaflow. We look forward to your questions and feedback.