This blog post is a collaboration between Coveo, a long-term Metaflow user, and Outerbounds. At Outerbounds, we’ve collaborated with Coveo several times, on such projects as Metaflow cards and a recent fireside chat about Reasonable Scale Machine Learning — You’re not Google and it’s totally OK. For this post, we sat down with Jacopo from Coveo to think through how they use Metaflow to connect DataOps with MLOps to answer the question: once data is properly transformed, how is it consumed downstream to produce business value?
In this and a previous post, we tell the story of how tools and culture changed together during our tenure at a fast-growing company, Coveo, and share an open-source repository embodying in working code our principles for data collaboration: in our experience, DataOps and MLOps are better done under the same principles, instead of “handing over” artifacts to the team on the other side of the fence. In particular, our goal here is twofold:
- In our previous post, we introduced the stack of tools used and a technical template for teams starting up and wondering how to join DataOps and MLOps efficiently (we’ll recap the main points below).
- In this post, we show how good tools provide a better way to think about the division of work and productivity, thus providing an organizational template for managers and data leaders.
Recap: From the Modern Data Stack to MLOPs
The modern data stack (MDS) has been consolidated as a series of best practices around data collection, storage, and transformation. In particular, the MDS encompasses three pillars:
- A scalable ingestion mechanism, either through tools (e.g. Fivetran, Airbyte) or custom infrastructure;
- A data warehouse (e.g. Snowflake) storing all data sources together;
- A transformation tool (e.g. dbt), ensuring versioned, DAG-like operations over raw data using SQL.
A lot has been said already about the MDS as such, but the situation is more “scattered” on the other side of the fence: once data is properly transformed, how is that consumed downstream to produce business value?
Our solution is to accept the fact that not every company requires elaborate and infinitely scalable infrastructure like those deployed by Googles and Metas of the world, and that is totally ok: doing ML at “reasonable scale” is more rewarding and effective than ever, thanks to a great ecosystem of vendors and open source solutions.
The backbone for this work is provided by Metaflow, our open-source framework which (among other things) lowers the barrier to entry for data scientists to take machine learning from prototype to production and the general stack looks like this, although Metaflow will allow you to switch in and out any other component parts:
How does this stack translate good data culture into working software at scale? A useful way to isolate (and reduce) complexity is by understanding where computation happens. In our pipeline, we have four computing steps, and two providers:
- Data is stored and transformed in Snowflake, which provides the underlying compute for SQL, including data transformations managed by a tool like dbt;
- Training happens on AWS Batch, leveraging the abstractions provided by Metaflow;
- Serving is on SageMaker, leveraging the PaaS offering by AWS;
- Scheduling is on AWS Step Functions, leveraging once again Metaflow (not shown in the repo, but straightforward to achieve).
A tale of two cultures
This stack is built for productivity: provisioning, scaling, versioning, and documentation all come for free. But it is also built for collaboration: part of it is easy to spot, as training models on Metaflow and Comet is indeed a team sport. You need both Mario and Luigi! Part of it is subtler, and we want to discuss it through our own experience at Coveo, a B2B SaaS company specializing in AI services, moving from an older stack to (a version of) this stack.
When Jacopo joined Coveo, the data team was in charge of safely storing application data and behavioral signals. From there, two teams transformed that data in modeling and insights, but they would do so in parallel:the BI team would perform their own transformation leveraging Redshift and SQL to power dashboards for Jacopo and his clients; the ML team would run a Spark-based stack, where transformations are Scala jobs moving data from s3 to s3, and finally into EMR for the training step.
Concerns for data semantics, pipeline duplication, and siloed intelligence
Before even discussing tools, we have three organizational problems to solve:
- Lack of ownership in data semantics: raw data is, unsurprisingly, raw. Understanding the nuances of any ingestion protocol is no easy feat: if data engineers own just persistence, and not normalization, we are now asking ML and BI folks to go into the subtleties of the ingestion pipeline. This is error-prone as they may not know what is the correct definition of, say, a shopping session, given our cookie strategy, and an unwanted dependency: we want data teams to work with well-understood entities in our business domain (cart, products, recommendations) not to reinvent the wheel every time they run a query.
- Duplication of data pipelines and storage, and proliferation of tools: we maintain two parallel pipelines, send data to S3 and Redshift (duplicating security, governance, etc.) and start a “my data / your data” dynamics: when a KPI for BI and ML is different, how do we get to the truth? Moreover, we now have Scala, SQL, Python, Spark, and an orchestrator (not shown) to master: if the BI team wants to help out the ML team, it’s going to be hard to even set up an environment for them.
- Siloed intelligence: not only do bad things happen in both pipelines but good things cannot be shared; if the BI team has a strategy to calculate conversion rate, the ML team would be completely oblivious to that work, as it sits on a different code base and a different data source.
The Solution: SQL, Python, and a Single Source of Truth
Our stack provides a much better blueprint for collaboration at reasonable scale:
First, we empower the Data Team to own raw data and the first layer of transformation, from JSON to basic entities in our business. Second, we all operate out of a Single Source of Truth (SSoT, i.e. Snowflake), so all insights come from the same place and lineage is easy. Third, we drastically cut down the number of tools to use: SQL and Python will get you all the way. Finally, we promote a culture of sharing, which is perfectly embodied by dbt Cloud: a lot of intermediate features may be in common between ML and BI, and they can collaborate effectively by building on each other’s work. If you look at the repository, for example, you’ll notice that we are using dbt+Snowflake basically as the offline component of a feature store.
Most of all, we found through everyday practice that this stack is the perfect playground for the end-to-end data scientist: the person that doesn’t need to know kubernetes, or the Google Ingestion protocol, but who is responsible for transforming data into models by iterating quickly and independently, going all the way from tables to endpoint and back. This stack makes her happy, as it’s heavy on open-source and light on people’s time: by encouraging data ownership, fostering collaboration, and abstracting away computation, our scientists can focus on the high-value, high-margin logic they want to write, to solve a business problem from start to finish.
In this post, we’ve shown how good tools provide a better way to think about the division of work and productivity, thus providing an organizational template for managers and data leaders. This was based upon the stack of tools used in our previous post, where we also introduced a technical template for teams starting up and wondering how to join DataOps and MLOps efficiently. We’ve also shared an open-source repository embodying in working code our principles for data collaboration: in our experience, DataOps and MLOps are better done under the same principles, instead of “handing over” artifacts to the team on the other side of the fence.
If these topics are of interest, come chat with us on our community slack here.