The Open-Source Modern Data Stack featuring Apache Iceberg
We recently had a fireside chat with Jason Reid, an ex-Director of Data Engineering at Netflix and co-founder of Tabular.io, about building a unified, enterprise-grade data platform for diverse workloads. We talked about data warehouses and lakes, ETL, and ML using open-source components, Apache Iceberg in particular.
Furthermore, we covered a lot of ground, so we wanted to share some highlights here. The following are rough transcripts edited for legibility. We’ve included clips of the relevant parts of the conversation and, if you’re interested, you can watch the entire conversation here:
Single sources of truth for data and OSS tool chains vs proprietary platforms?
If you go with the open source route, you are always going to have more choices at your disposal, less lock-in, compared to a vertical proprietary stack… such as bleeding edge generative AI tools.
HBA: How do you think about having a single source of truth for all your data needs?
JR: Historically that hasn’t really been technically feasible, besides for large companies like Netflix and Apple, which could invest a lot of engineering effort into building these single source of truth systems. As open-source projects mature and commercial products emerge around them, that architecture becomes available to the vast majority of companies. It is a very powerful setup ultimately.
Netflix open-sourced Iceberg as they didn't want to be on the hook internally for continuing to keep all the integrations like Spark and Trino up to date or integrate to other tools like Druid. It was a much better thing to share with the community, and then get the whole community involved in building new connectivity and maintaining. It does take a village to keep everything going in the right direction.
HBA: Why and when is a tool chain of open-source tools a more appropriate choice for your data stack than a single proprietary platform?
JR: Historically, the choice was a bit more straightforward. You were either fully in a proprietary platform like Oracle or Teradata - they owned all of your compute and all of your storage, and you could do only what those systems allowed you to do.
Or, you could choose open-source, make it a choose-your-own adventure, and stitch technologies together yourself, which needed a lot of engineering resources. It was expensive, but you gained a lot of power, flexibility, and long-term cost mechanics because you weren't tied into the expensive part from licensing fees.
As an example, Netflix was a big Teradata shop early on. It was incredibly expensive and could really only do SQL analytics, but we wanted to do more. We moved this to OpenStack which was a painful engineering process.
HBA: How has the open-source landscape changed since?
JR: Now we're in this world where we have more, even proprietary systems, that have more optionality and more integration, so you can take something like Snowflake, and that actually has support for Iceberg storage tables.
In today’s hybrid world, we can use a combination of open-source and proprietary solutions, so there’s more optionality. If you go with the open source route, you are always going to have more choices at your disposal, less lock-in, compared to a vertical proprietary stack. And the question then becomes is that optionality that you gain worth it for?
On the one hand, you can have an easy-to-use integrated platform where everything works relatively well together, but you're limited in your choices. And you're tied to their cost model. Or, you can go with an open route, have multiple tools and optionality and more control over costs, but with the challenges that come through integrating multiple tools. There are trade-offs to be had for sure.
You can make a choice today but what about next year when the next new tool comes along? How quickly can you integrate that? And are you going to fall behind competitors, because they're able to integrate a new technology? For instance, they start using generative AI tools, but you can't, because you're still waiting for your vendor to provide something in that space.
I think that optionality is also about speed and agility and time to market. Companies like Tabular and Outerbounds built on open-source tooling are making this option more lucrative.
Data pipelines, security, and regulatory requirements
It goes back to being able to have a really good lineage and understand how these data products are built, how they're connected, the connectivity across all these things.
HBA: How can you make sure that your data and pipelines meet security and regulatory requirements?
JR: This is a bit of a mindset shift architecturally. In a world where your compute and your storage were bundled together, you could secure that. That became really messy when we started collecting data in these open formats.
How do you effectively do security, especially when you've got multiple different ways to access data? And we're in this architecture, in which we all want a single source of truth and read it from multiple different tools. How do we successfully lock that down or even know who has access to what? How do we audit it? Who's accessing the data, like really difficult stuff?
Now we move to these distributed data architectures when you start to move the security layer down to the physical storage and make those compute layers basically come with some sort of authentication and authorization to the store to say like, Hey, I'm Jason, here's my authorization that says, I should have read access to this data, can you please let me write, and no matter if I'm coming from a Python process, or I'm coming from a Spark job, or I'm coming from a Trino query, that same exchange has to happen.
So I think that's where we're going to move to if we're going to be successful in this architecture. And it's definitely a challenge, and it's definitely a shift that would get us to at least security, and then we can audit it? And all that security is happening at that storage layer, we can go back again, a nice clean audit. Then there are the secondary concerns… What about GDPR? And those kinds of things when customers say, Hey, forget that you have data on me. And that's yet another suite of capabilities that we need to build out in this architecture. We're still in the early stages of doing that kind of stuff.
Model governance, LLMs, and extraction attacks!
HBA: How do we even think about this when we have data but also feature stores and metric stores and all of these different types of things?
JR: I think those are great questions. I think this is where the regulator is always going to be behind the technology. What if you take all of my search history, and you turn that into some embedding, and use it in a model, and then I tell you, hey, you need to forget about me?
Okay, you can drop the records but you use all that data to build your model that is up for grabs in these large language models that have been built on, you know, a corpus of texts that have been what did people give permission for that to be used in the model? This is uncharted territory!
Again, it goes back to being able to have a really good lineage and understand how these data products are built, how they're connected, the connectivity across all these things.
HBA: I'm glad you mentioned the existence of models that are trained on data as well. There's a whole fascinating area of research on extraction attacks of taking models and being able to extract training data from the models, right?
JR: It is, and it's a new vector of attack. As good architects, we should try and anticipate these future requirements. And technical architecture is a lot about having two-way doors and optionality, right? Don't design yourself into corners that you can't get out of, or there will be big migration efforts, you know, big tech debt holes that you've created for yourself or your company down the line because you didn't think about possible outcomes.
Iceberg and Metaflow for principled, robust, and reproducible data science and ML
As a data scientist, again, I want to focus on building models or doing data science, I don't want to have to be an expert on Parquet or ORC file formats, or dealing with partitions manually.
HBA: Tell me more about why data scientists would want to use Apache Iceberg for working with data.
JR: Perfect, this is where I can hopefully make some connections with Metaflow and Outerbounds: If you're a data scientist, why should you care about Iceberg? Let’s say I have these Parquet files on S3. Why do I think Iceberg tables are preferred? Here’s a summary:
- Schema: You get to address a data set as just not a data set, but the schema. You don't have to be concerned about the intricacies of Parquet or ORC or any of these formats, or dealing with partitions manually.
- Persist results: You can write results back to Iceberg tables, since now you've just made it that much easier for anybody else in the ecosystem to take advantage of that result that you produced.
- Constantly updating data: Imagine that data updates constantly. You don’t want to be working with old Parquet files. Let's just talk about tables instead of files. That's a big win for interoperability for yourself and your ecosystem.
- Reproducibility: This is where Iceberg is really fantastic for data science use cases, because every change that is made to an Iceberg table, results in an immutable snapshot of that data set. As a data scientist, I can say, I'm training this model against these very specific versions of my underlying source data, and I produced this specific version of output.
Join us for more chats!
After our fireside chats, we have AMAs with our guests on Outerbounds community slack. A lot tends to happen there so join us if you’re interested in such conversations!
You can join us on slack for more such conversations and can view the other fireside chats here.
You can also sign up for our next Fireside Chat Kubernetes for Data Scientists here!
Start building today
Join our office hours for a live demo! Whether you're curious about Outerbounds or have specific questions - nothing is off limits.