New in Metaflow: Accessing Secrets Securely

Authors

You can now access secrets securely in Metaflow flows using the new @secrets decorator. This video shows how to do it in less than a minute using AWS Secrets Manager (no sound):

Motivation

Consider a Metaflow flow that needs to access an external resource, say, a database requiring authentication such as a username and password. Cases like this are common.

Thus far, there have been two main ways to handle this:

  1. Delegating authentication to the execution environment, e.g. to an IAM user executing the code.
  2. Accessing credentials from a file or an environment variable.

The first option can be secure, easy to manage centrally, and hence preferable in many cases. Unfortunately, it is mainly applicable to a handful of services like S3 which work with IAM natively. If you want to connect to 3rd party services like Snowflake, you need another approach.

The second option works with any service but storing secrets in local files is considered bad practice for many good reasons. Locally stored secrets are hard to manage - what happens if the database password changes - and they can leak both easily and inconveniently.

Secret managers, such as AWS Secrets Manager, provide a third option that combines the best of the two approaches. They allow arbitrary secrets to be stored and managed centrally. Accessing secrets is controlled through IAM roles that are available through the execution environment. Additionally, secrets are never stored in any environment outside the manager.

The new @secrets decorator

Earlier, there was a small speed bump if you wanted to use a secrets manager: Accessing a secret e.g. using the boto library takes 15-20 lines of boilerplate infastructure code which, as a data scientist, you would rather not worry about.

To make it easier to write production-ready code without cognitive overhead, Metaflow now provides a @secrets decorator that handles this with one line. Besides being a convenient abstraction, @secrets provides a standardized way to access secrets in all projects across environments.

Here is an example that uses the @secrets decorator to access a secret named db-credentials. The secret contains four key-value pairs that specify everything needed to connect to a Postgres database:

from metaflow import FlowSpec, step, secrets
import os
from psycopg import connect

class DBFlow(FlowSpec):

    @secrets(sources=['db-credentials'])
    @step
    def start(self):
        with connect(user=os.environ['DB_USER'],
                     password=os.environ['DB_PASSWORD'],
                     dbname=os.environ['DB_NAME'],
                     host=os.environ['DB_HOST']) as conn:

            with conn.cursor() as cur:
                cur.execute("SELECT * FROM data")
                print(cur.fetchall())

        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    DBFlow()

Assuming you have db-credentials stored in AWS Secrets Manager, you can execute the flow on your workstation:

python dbflow.py run

or run it at scale on @batch or @kubernetes as usual:

python dbflow.py run --with kubernetes

Keeping many @secrets

Often, data scientists develop and test their code and models using a non-production dataset. The @secrets decorator supports this scenario smoothly.

Consider the above code snippet featuring DBFlow but with the @secrets line removed. If you have a test database deployed locally, you can simply set the environment variables manually without having to use a secrets manager. This is ok, as the local database is not accessible to anyone outside your workstation (and it shouldn't contain sensitive data, in any case):

export DB_HOST=localhost
export DB_USER=me
export DB_NAME=testdb
export DB_PASSWORD=not_a_secret

python dbflow.py run

Alternatively, your company may have a shared database containing test data. In this case, you can access its credentials in a secret, say, test-db-credentials, and run the flow like this:

python dbflow.py
   -–with 'secrets:sources=[“test-db-credentials”]'
   run

As usual in Metaflow, the --with option attaches the decorator to all steps without having to hardcode it in the code.

To deploy the flow in production, you can have a CI/CD pipeline with a separate IAM role that has exclusive access to production credentials. It can deploy the flow to production like this:

python dbflow.py
  -–with 'secrets:sources=["prod-db-credentials”]'
  argo-workflows create

In this scenario, the IAM roles assigned to data scientists may disallow access to the prod-db-credentials altogether. The production credentials and the database are only accessible to production tasks running on Argo Workflows.

Crucially, in all these cases you don't have to change anything in the code as you moved from prototype to production.

@secrets to success

You can start using the @secrets decorator today! For additional features and setup instructions, read the documentation for @secrets.

If you need help getting started or if you have any other questions, join us and thousands of other data scientists and engineers on the Metaflow community Slack! In particular, we would like to hear from you if you would like to see support for other backends besides AWS Secrets Manager.