Table of Contents

Outerbounds for the next decade - and this week!

September 16, 2024

This week we will be releasing major new features, as well as highlighting the success that our customers have had building their ML/AI systems on Outerbounds. Read this post for our long-term vision and stay tuned for the releases over the next few days.

The past few years have been tumultuous for data science, analytics, and machine learning. Teams are expected to deliver value efficiently - gone are the days of free money and ballooning organizations. Meanwhile, generative AI has provided a sneak peek into a future where data and models play a much bigger role in products and processes than anyone anticipated, but the technology is still immature and evolving rapidly.

Leading an ML/AI organization in 2024 is a balancing act: You need to deliver value today - preferably more than before, quicker. But, you don’t want to miss out on the future that, rather inevitably, will involve AI in various forms that will emerge and mature over the coming years.

Be ready for whatever comes

As with earlier paradigm shifts, we can adapt by making progress at multiple layers simultaneously, accepting the fact that some changes will happen quickly, others slowly. The concept is neatly illustrated in Steward Brand’s pace layers where “the fast layers innovate; the slow layers stabilize”:

Pace Layers from Steward Brand's *The Clock of Long Now*

It will take many years for the culture and governance - both at the level of an individual company as well as a society as a whole - to adapt to the capabilities afforded by ML and AI. Meanwhile, you can adopt an infrastructure layer that allows you to explore new features, products, and business processes at a pace that the commerce demands.

Standing on a solid foundation, you can keep calm and pay gentle attention to the fast fashion of the latest models and frameworks, experimenting with the ones that seem best aligned with your goals.

What will change

Beyond fast fashion, what fundamental changes are on the horizon?

We believe that over the coming decade, nearly all software systems will be powered by an intricate interplay between code, data, and models.

Historically, systems have been built with human-authored code, occasionally coupled with small models and varying amounts of data. Over the coming decades, systems will still be defined in code, increasingly authored with machine assistance, but the code will be accompanied by data and models.

The long-term impact of this shouldn’t be underestimated. The nature of software engineering will change fundamentally. Systems will become more stochastic, demanding a new approach to systems design, akin to the mindset and culture that we learned at Netflix for building large-scale distributed systems.

Models and data become an inherent chaos monkey in every system. By embracing the new paradigm rather than evading change, we will be able to build more sophisticated systems without compromising robustness.

With more sophisticated machines, we will be able to provide experiences that are more human. The systems will be able to deal with more real-world diversity and complexity, more nuance, and more human-like interaction - and deal with deeper science in fields like drug discovery and climate technology.

Adjusting roles

If you come from a platform engineering or DevOps background, the presence of models and data will force you to rethink certain aspects of software and infrastructure.

Systems will be exposed to much more entropy through data and the need for compute capacity will go through the roof. These changes will introduce a host of gnarly issues related to novel hardware, distributed systems, observability, and finicky software supply chains.

Purple hats develop with models, data, and code / yellow hats manage infrastructure

In contrast, if you come from a data (science) or ML background, the emergence of AI is a good reason to brush up your software skills, as your work is being promoted from the back office to the frontlines.

A few weeks ago, we asked Santiago Valderama, who has been building AI/ML systems for over two decades, and has a following of over half a million AI/ML developers, what he considers to be the most important skill in the era of AI:

You have to get good at writing code. That’s it. Using these AI APIs requires writing code. Calling an LLM API is 1% [of the effort] and the other 99% is managing exceptions and dealing with failures. There is a lot of plumbing that has to be written by somebody. Even if an LLM is writing 50% of it you have to adjust the 50% to make it 100% right.

To meet the needs of platform engineers and data developers, we need new infrastructure that simplifies building production systems involving models and data, while adhering to proven software engineering best practices. While novel infrastructure is required, it must exist in the context of existing governance and cultural frameworks which will take a longer time to change.

We can handle code, data, and models already, thankyouverymuch

At Outerbounds, we work with established data science, ML, and AI teams, many of which have been building business-critical ML systems for years. Hence, by necessity, all of them have either bought or built infrastructure that can deal with data, code, and models.

Our industry is new, so we see a fascinating diversity of existing solutions, each with their own motivations, design constraints, and path dependencies. This is how technological progress is supposed to work!

If everyone has a velocipede of the mind already - or at least they have seen too many of them already - is there anything more that needs to be done?

Just as bicycle development hadn’t reached even a local optimum in the 1860s, one can argue that we are just in the penny-farthing era of MLOps. If you squint your eyes and relax your expectations, the shape of the solution is kind of correct - surely you can access data, build models, and orchestrate everything with code - but the whole thing feels clunky, unsafe, and brittle.

Dreaming of a better bike

Imagine a world where you will be building systems with code, data, and models. Now, close your eyes and dream of an environment that makes you maximally productive, allowing you to explore new ideas and techniques at the speed of thought. What do you see?

Firstly, I see a proper development environment where I can develop these systems frictionlessly. I don’t want to worry about configuration, dealing with conflicting library versions or CUDA drivers, or not being able to access data due to company policies. Also, I want no-nonsense APIs to all foundational components, so I can focus on building higher-level systems without having to intermingle my models and business logic with infrastructure boilerplate.

Secondly, if I may dream freely, I don’t want to be bottlenecked by compute ever. Whatever amount of data I may have, however large my models may be, I want to be able to push a button and get results without hassle. I may want to scale out and test five ideas concurrently. Surely there’s enough compute capacity in the cloud to allow me to run all the tests in parallel. On top of this, I don’t want anyone to yell at me, complaining about the cost of compute.

Thirdly, when I am done, I want to push a button to deploy my work to production. As a responsible person, I understand the gravity of running a business-critical production system, so I am happy to work with CI/CD systems or whatever best practices are needed to make sure that my work is production-ready. I am also eager to keep improving my work systematically over time, if the system is delivering the value it should.

Develop, scale, deploy

We believe that the develop-scale-deploy cycle is so central to the development of ML/AI systems that it is best served as a seamless workflow, rather than as disparate tools. You may have tried systems in the past that promise any two of the above:

Develop + Deploy → Simplistic solutions; unable to address the needs of real-world ML/AI.
Develop + Scale → Great for experimentation; hard to integrate into production.
Scale + Deploy → Laissez-faire development; hard for practitioners to produce results quickly and consistently.

Since the beginning of open-source Metaflow which we launched at Netflix five years ago, we have been on a mission to provide a delightful experience for developing, scaling, and deploying real-world ML/AI systems.

While no one disagrees about the importance of developer experience these days, our focus on concrete, value-producing, pragmatic use cases, such as the diverse ML/AI systems powered by Metaflow at Netflix, over shiny technology has been a positive differentiator for us. This feet-on-the-ground and head-in-the-clouds attitude has proven to be especially effective in the noisy times of the latest AI boom. Ultimately, ML and AI are just means to an end.

Today, thousands of companies are building their ML/AI systems on Metaflow - including highly impactful real-world use cases such as recommendations at Amazon Prime Video, fraud detection at Ramp, or innovative GenAI applications at Autodesk. Increasingly, systems like this run on Outerbounds that takes developing, scaling, and deploying ML and AI to the next level.

To the next level - this week!

This week, we will be releasing a number of improvements in Outerbounds, developed in close collaboration with our customers and the open-source community, enabling you to develop, scale, and deploy real-world ML and AI systems more effectively:

On Tuesday 9/17, we will highlight a number of improvements in the development experience.
On Wednesday 9/18, we will be at the PyTorch conference (come and say hi at booth S12!), showcasing the latest features for working with demanding ML/AI models.
On Thursday 9/19, we will share how you can deploy to production more confidently with a particular focus on the latest generation of LLM systems.
On Friday 9/20, we will host a webinar with NVIDIA, diving deeper into the best practices of using LLMs in production and NVIDIA NIM - sign up to reserve a spot!

Check out our blog daily this week!

The future is already here, it's just unevenly distributed

Although the features will be publicly released in the coming days, they are already driving the ML and AI-powered world.

It has been immensely gratifying to see the success that our customers have had developing, scaling, and deploying their ML/AI systems on Outerbounds. If you are curious, you can take a sneak peek behind the scenes and learn about

and many others.

Start building

AI and ML are eating the world. Like with past paradigm shifts, it is best to stay calm and focus on fundamentals: people, solid technical foundation, and real-world value creation. With this baseline, you can navigate whatever waves AI brings in the coming years.

We are happy to help, should you want to develop, scale, and deploy your ML/AI projects more effectively. Come back tomorrow for more details, or start building today!

‍PS. If building a human-friendly platform for serious ML/AI systems resonates with you, we are hiring!

‍