Skip to main content

Recommender Systems with Metaflow Tutorial

This tutorial on RecSys is the first adventure in training recommender systems using Metaflow to provide a scalable workflow you can use for experimentation and production. The goal is to develop a relatively simple, effective, and general pipeline for sequential recommendations, and show how you can use popular open-source libraries and tools including DuckDB, Gensim, Metaflow, and Keras to build a fully working cloud endpoint that serves predictions in real-time, starting from raw data.

If you are new to Metaflow we recommend starting with the introductory tutorials to get up to speed on the basics of Metaflow before returning to this tutorial.


Our use case is: given a training set made by music playlists (list of songs hand-curated by users), can we suggest what to listen to next when presented with a new song?

By following along you will learn how to:

  • take a recommender system idea from prototype to real-time production;
  • leverage Metaflow to train different versions of the same model and pick the best one;
  • use Metaflow cards to save important details about model performance;
  • package a representation of your data in a Keras object that you can deploy directly from the flow to a cloud endpoint with AWS Sagemaker.


This tutorial does not assume knowledge about recommender systems, but does assume the following:

  • you are familiar with the basic concepts of Metaflow (flow, steps, tasks, client API, etc.), know how to run a flow;
  • you are familiar with the basic concepts of machine learning such as training, validation, and test split for model evaluation.

Bonus points (no worries, we will explain all of this) if you:

  • have experience with model serving;
  • know a little bit about what embeddings or neural networks are;
  • are comfortable with AWS concepts such as storing data in S3 and using SageMaker.

Tutorial Structure

The content includes the following:

Each episode contains either a Metaflow script to run or a Jupyter notebook. You do not need access to cloud computing or a Metaflow deployment to complete the first five episodes. If you want to run the final episode that deploys your model to an endpoint, you will need access to an AWS account that has an IAM role that can execute operations on Sagemaker. We will explain these details in that episode. As always, feel free to reach us in the #ask-metaflow channel on Slack if you need help deploying Metaflow on your infrastructure! The estimated time to complete all episodes is 1-2 hours.

Why Metaflow?

The main benefit of using a data science workflow solution like Metaflow when prototyping is that your code will be built on a strong foundation for deploying to a production environment. Metaflow is most useful when projects have scaling requirements, are mission-critical, and/or have many interacting parts. You can read more at these links:

After completing the lessons, you can transfer insights and code from the tutorial to your real-world data science projects. This is a beginner tutorial so it will not reflect many important challenges to consider in production ML environments. For example, in production, you may consider using Metaflow features such as the @conda decorator for dependency management, @batch or @kubernetes for remote execution, and @schedule to automatically trigger jobs.

Recommender System Resources

  • A gentle introduction to RecSys concepts, with a focus on metrics and testing.
  • A research paper on the methodology (prod2vec) we adopted to solve the use case demonstrated in this tutorial.