You can now access secrets securely in Metaflow flows using the new @secrets
decorator. This video shows how to do it in less than a minute using AWS Secrets Manager (no sound):
Motivation
Consider a Metaflow flow that needs to access an external resource, say, a database requiring authentication such as a username and password. Cases like this are common.
Thus far, there have been two main ways to handle this:
Delegating authentication to the execution environment, e.g. to an IAM user executing the code.
Accessing credentials from a file or an environment variable.
The first option can be secure, easy to manage centrally, and hence preferable in many cases. Unfortunately, it is mainly applicable to a handful of services like S3 which work with IAM natively. If you want to connect to 3rd party services like Snowflake, you need another approach.
The second option works with any service but storing secrets in local files is considered bad practice for many good reasons. Locally stored secrets are hard to manage - what happens if the database password changes - and they can leak both easily and inconveniently.
Secret managers, such as AWS Secrets Manager, provide a third option that combines the best of the two approaches. They allow arbitrary secrets to be stored and managed centrally. Accessing secrets is controlled through IAM roles that are available through the execution environment. Additionally, secrets are never stored in any environment outside the manager.
The new @secrets
decorator
Earlier, there was a small speed bump if you wanted to use a secrets manager: Accessing a secret e.g. using the boto
library takes 15-20 lines of boilerplate infastructure code which, as a data scientist, you would rather not worry about.
To make it easier to write production-ready code without cognitive overhead, Metaflow now provides a @secrets
decorator that handles this with one line. Besides being a convenient abstraction, @secrets
provides a standardized way to access secrets in all projects across environments.
Here is an example that uses the @secrets
decorator to access a secret named db-credentials
. The secret contains four key-value pairs that specify everything needed to connect to a Postgres database:
from metaflow import FlowSpec, step, secrets
import os
from psycopg import connect
class DBFlow(FlowSpec):
@secrets(sources=['db-credentials'])
@step
def start(self):
with connect(user=os.environ['DB_USER'],
password=os.environ['DB_PASSWORD'],
dbname=os.environ['DB_NAME'],
host=os.environ['DB_HOST']) as conn:
with conn.cursor() as cur:
cur.execute("SELECT * FROM data")
print(cur.fetchall())
self.next(self.end)
@step
def end(self):
pass
if __name__ == '__main__':
DBFlow()
Assuming you have db-credentials
stored in AWS Secrets Manager, you can execute the flow on your workstation:
python dbflow.py run
or run it at scale on @batch
or @kubernetes
as usual:
python dbflow.py run --with kubernetes
Keeping many @secrets
Often, data scientists develop and test their code and models using a non-production dataset. The @secrets
decorator supports this scenario smoothly.
Consider the above code snippet featuring DBFlow
but with the @secrets
line removed. If you have a test database deployed locally, you can simply set the environment variables manually without having to use a secrets manager. This is ok, as the local database is not accessible to anyone outside your workstation (and it shouldn't contain sensitive data, in any case):
export DB_HOST=localhost
export DB_USER=me
export DB_NAME=testdb
export DB_PASSWORD=not_a_secret
python dbflow.py run
Alternatively, your company may have a shared database containing test data. In this case, you can access its credentials in a secret, say, test-db-credentials
, and run the flow like this:
python dbflow.py
-–with 'secrets:sources=[“test-db-credentials”]'
run
As usual in Metaflow, the --with
option attaches the decorator to all steps without having to hardcode it
in the code.
To deploy the flow in production, you can have a CI/CD pipeline with a separate IAM role that has exclusive access to production credentials. It can deploy the flow to production like this:
python dbflow.py
-–with 'secrets:sources=["prod-db-credentials”]'
argo-workflows create
In this scenario, the IAM roles assigned to data scientists may disallow access to the prod-db-credentials
altogether. The production credentials and the database are only accessible to production tasks running on Argo Workflows.
Crucially, in all these cases you don't have to change anything in the code as you moved from prototype to production.
@secrets
to success
You can start using the @secrets
decorator today! For additional features and setup instructions, read the documentation for @secrets
.
If you need help getting started or if you have any other questions, join us and thousands of other data scientists and engineers on the Metaflow community Slack! In particular, we would like to hear from you if you would like to see support for other backends besides AWS Secrets Manager.