Table of Contents

Book Launch: Effective Data Science Infrastructure

August 30, 2022

By reading a new book that I authored, Effective Data Science Infrastructure, you will learn how to set up infrastructure for ML and data science applications, similar to the stack that powers Netflix and hundreds of other modern companies.

Why write a book about data science infrastructure?

When Manning, a well known publisher of technical books, approached me in April 2020 about writing a new book about “doing ML with Python”, I had two immediate reactions (having written a book once before): First, writing a book takes a crazy amount of work, so rationally my answer should be no way. Second, do technical books even matter these days? Much of the world’s knowledge is available online for free.

I had given a number of presentations about how to make data scientists more productive and more effective at developing and deploying machine learning applications in general, and our open-source project Metaflow in particular. After every such presentation, I had a nagging feeling that I had only scratched the surface. Is the audience really able to apply the ideas at home based on a few slides? While general principles and pretty high-level figures are useful, the devil really is in the details.

I had seen that ML systems come in many shapes and sizes. The most successful systems are developed by autonomous teams who have a deep understanding of their problem domain, as well as a good enough grasp of the technical stack that powers the system.

There’s a plethora of excellent books and papers available about various domains of ML, say, recommender systems, computer vision, and forecasting, but it was – and still is – clear that the world needs more education about the infrastructure stack which helps turn these ideas from academic papers into living and evolving systems.

Hence over the locked-down summer of 2020 the idea of a long-form treatise of the full stack of data science infrastructure started feeling enticing. The material wouldn’t be freely available online, which is clearly a downside, but on the other hand a book is a time-tested, effective medium for developing a deep understanding of a new field, which is not easily achieved by browsing random web pages.

The book that grew over the next two years, Effective Data Science Infrastructure, distills a decade of experience by me and tens of data scientists and platform engineers from Netflix and other companies who have built and operated business-critical, large-scale ML systems. Compressing all the concepts and learnings into a readable book took a number of iterations but the result – and the journey of getting there – was made infinitely better thanks to my experienced editorial team at Manning, Doug Rudder and Nick Watts in particular.

What do you learn by reading this book?

The book provides a systematic walk-through of everything you need to know to design and deploy a modern infrastructure stack for machine learning and data science applications.

The name has an important qualifier, effective, signaling that the end goal is to produce positive business impact with the infrastructure, not to build it for its own sake. The tagline, how to make data scientists productive, signals that ultimately the impact is produced by humans, data scientists, not by the infrastructure itself, so usability, ergonomics, and human-centricity are overarching topics in the book.

The book advocates for the idea that data-intensive applications require a whole stack of infrastructure, as depicted in this figure:

You need to consider data, how and where to process the data exactly (compute resources), and how to orchestrate processing that consists of multiple stages (job scheduler), while acknowledging the fact that it will take multiple versions and iterations to produce the desired outcome (versioning). Fortunately you don’t need a Google-scale solution from the get-go, but as the book shows, you can develop a solution that grows with your organization over time.

These concerns provide just the technical foundations – the machinery. On top of this, we need people, the data scientists who are often not software engineers by training, to architect the applications. Many organizations have realized that in order to increase the speed of iteration and improvement, it is convenient to give the same people tools to operate the applications to a large degree as well. Finally, we need to tap into the core expertise of data scientists when it comes to ML-specific issues like feature engineering and model development.

As shown by the triangles in the figure, an effective infrastructure needs to strike a balance between the needs of humans and the machines. All these topics are covered by the book, as it is hard to operate an effective data science organization without addressing all these concerns systematically. Unsurprisingly given my background the book uses Metaflow to illustrate the ideas, but you can apply the concepts to other frameworks as well.

Learn by doing

Although the book doesn’t teach you machine learning (many great books do that already), I wanted to ground the discussion into realistic examples. In the course of reading the book, you get to build tiny applications that utilize unsupervised, supervised, and deep learning, natural language processing, visualizations, time-series forecasting. And, of course, you get to build a fully functional movie recommendation system!

Thanks to Metaflow, the examples are rather concise and readable, like this one that performs a hyperparameter sweep for K-means clustering, optionally leveraging parallel compute in the cloud:

from metaflow import FlowSpec, step, Parameter, resources, conda_base, profile

@conda_base(python='3.8.3', libraries={'scikit-learn': '0.24.1'})
class ManyKmeansFlow(FlowSpec):

    num_docs = Parameter('num-docs', help='Number of documents', default=1000000)

    @resources(memory=4000)
    @step
    def start(self):
        import scale_data
        docs = scale_data.load_yelp_reviews(self.num_docs)
        self.mtx, self.cols = scale_data.make_matrix(docs)
        self.k_params = list(range(5, 55, 5))
        self.next(self.train_kmeans, foreach='k_params')

    @resources(cpu=4, memory=4000)
    @step
    def train_kmeans(self):
        from sklearn.cluster import KMeans
        self.k = self.input
        with profile('k-means'):
            kmeans = KMeans(n_clusters=self.k, verbose=1, n_init=1)
            kmeans.fit(self.mtx)
        self.clusters = kmeans.labels_
        self.next(self.analyze)

    @step
    def analyze(self):
        from analyze_kmeans import top_words
        self.top = top_words(self.k, self.clusters, self.mtx, self.cols)
        self.next(self.join)

    @step
    def join(self, inputs):
        self.top = {inp.k: inp.top for inp in inputs}
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    ManyKmeansFlow()

If you are curious, you can find all code snippets featured in the book in this GitHub repository.

Is this book for me?

Designing, building, operating, and using effective data science infrastructure is a team sport. The book is written from the perspective of two protagonists: Alex the data scientist and Bowie the platform engineer. If you identify with either or both of these characters, you should read the book.

To make sure that you can actually learn practical, real-life skills and knowledge from the book, all sections are prefaced with a real-world scenario which often involve both Alex and Bowie. Here are a few examples to give you an idea:

This figure is from Chapter 5, which introduces cloud-based compute layers like Kubernetes and AWS Batch. You will learn how to set up these systems in such a way that Alex the data scientist doesn’t have to worry about them too much – they can just benefit from the cloud compute that allows them to experiment freely – while the operational burden is minimized for Bowie the engineer.

Similarly converting raw data to features in a reliable manner, at scale, takes some coordination, as described in Chapter 8 and 9:

Even when all the foundational infrastructure is humming reliably, data scientists need tooling and support for version control and experiment tracking. Reproducibility in a business environment is not just a noble goal but a requirement for effective teamwork, as discussed in Chapter 6:

Drawing these illustrations and over 100 others was one of the most fun parts of my authorship.

What next?

Go and buy the book at the Manning web site (or Amazon)! Besides supporting the hard-working folks at Manning, your money will go to support underrepresented groups in data science, so even if you don’t like the book, your money won’t be wasted! If you feel extra generous, please leave a review on either or both of those sites 🙏.

I would love to hear your feedback, thoughts, and questions. Join our Slack to chat with me (@ville – DM me on Slack for a discount code) and over 1500 other Alexes and Bowies who design and use effective data science infrastructure of various kinds daily 🐶. The book is just a blueprint. Our whole community is there to help you during your learning journey and when you get to operate your stack in practice.

Finally, we are organizing an exciting live-streamed event that will feature myself as well as another amazing author, Chip Huyen, who also recently published a book about these topics. We will discuss questions related to writing technical books, educating the world about ML, and data science infrastructure in general – as well as share an exciting product announcement! Register today so you won’t miss it! 😊

Book Launch: Effective Data Science Infrastructure

Why write a book about data science infrastructure?​

What do you learn by reading this book?​

Learn by doing​

Is this book for me?​

What next?​

Start building today

Why write a book about data science infrastructure?

What do you learn by reading this book?

Learn by doing

Is this book for me?

What next?