Sign up to receive email notifications when new chapters drop.
New Frontier
"Invention, it must be humbly admitted, does not consist in creating out of void, but out of chaos"
These companies have effectively started building their own data systems in-house. What does that look like? A typical data system is a collection of both hardware and software that people in an organization can use to work with data. The layers of the system work together so that people can store, access, transform, and analyze data. Databases, Data Warehouses, and Data Lakes are all examples of data systems.
The dream is that the system just works seamlessly: there is infinitely scalable storage, execution is fast, and data can be queried without caveats about which programming languages or APIs are supported. As Benn Stancil put it: “bigger, faster, and polyglot.”
In “search of database nirvana”, the BYODS movement was born. Companies that could afford to build all or a significant portion of the software components from scratch - engineering entirely bespoke data systems to serve their niche needs.
Although we cannot know the exact numbers, a fair estimate is that each of these companies has invested upwards of $100 million dollars a year to the cause. And for most companies, building data systems is not their core business. From companies that are building databases as their core business, we know that it is both hard and expensive to build a new database.
Name | Money Raised |
---|---|
Snowflake | $2B |
Databricks | $3.5B |
MongoDB | $311M |
SingleStore | $464.1M |
Cockroach Labs (CockroachDB) | $633.1M |
Pingcap (TiDB) | $341.6M |
Elastic | $162M |
TimescaleDB | $181.1M |
MotherDuck | $47.5M |
Polars | $3.6M |
1.1 Why BYODS?
So, why are companies like Netflix building their own data systems in house? As a group of engineers at Meta, Voltron Data, Databricks, and Sundeck note:
“Hundreds of database system offerings were developed in the last few decades and are today available in the industry…”
The fundamental reason most companies chose to go the BYODS path-even with hundreds of choices and even though the investment was so expensive-is that they thought there was nothing in the market to serve their needs. And at the time, they were right.
The decision to spend millions of dollars and years of development on a moonshot data infrastructure project cannot be easy for any company. To push a company over the edge, it was likely a combination of:
1.Micro trends (daily struggles that pushed them away from their current system), and
2.Macro trends (trends that pulled them towards making a change)
1.1.1 Micro Trends
There are three common push forces at play, all of which are tough for any off-the-shelf data system to deliver:
- Performance: Nobody likes a slow system. Speed matters. Compute time matters. Data transport time matters.
- Scale: Nobody likes running out of resources. Especially when they have to rewrite queries to scale a workload.
-
Lock-in: Nobody likes feeling locked into their stack. Especially when they know that any change to system means they’ll need to:
- Migrate all the data into the new system,
- Retrain all the data teams to use the new system, and
- Rewrite all the existing data workloads in a new user interface.
1.1.2 Macro Trends
So the micro trends made teams feel like they were stuck with their data systems. Meanwhile, these same teams were looking ahead, and could see the emergence of two macro trends on the horizon:
1.The AI Arms Race: 1
Even before the rise of LLMs, there was a real need for “faster, more scalable, more cost effective machine learning.”2 In the early days of the BYODB movement, many companies may not have been able to pinpoint exactly how or why they would need to use AI or ML. But, for sure, no one wanted to be caught flat-footed in the “AI Arms Race.” The FOMO was real.
2.The rise of GPUs and other accelerated hardware:
Hardware has changed both in predictable and unpredictable ways since the start of the BYODB movement. What was once a field where CPUs reigned is now “a wild west of hardware”, where chips like GPUs, FPGAs, ASICs, TPUs, and other specialized hardware have been rapidly evolving. But, as Wes McKinney noted in 2018, “the tools we use have not scaled so gracefully to take advantage of modern hardware.”3 Developers leading the BYODB movement realized that while software was a gatekeeper to accelerated hardware, this was just temporary. In the future, software would provide a gateway to accelerated hardware, but only for those whose systems were positioned to win the hardware lottery. Just as with AI, no one wanted to hear the question “GPUs are here - are we ready?” and answer no. The FUD was real.
Trends in Hardware:
- Networking and Storage have become much faster
- Networking: 10 -> 40 -> 100 -> 400GBps in the data center
- Storage: > 5000 MB/s, > 100,000 IOPs
- Single-core CPU performance topping out
- Memory bandwidth, core counts, and vectorization continue to improve
- Hardware accelerators (GPU, FPGA, ASIC) delivering better compute

For all these reasons and probably more, we now have a nice little cohort of companies who have been building their own data systems and lived to tell us about it.
1.2 Building a New Frontier
What does this new frontier of data systems look like? It actually looks a lot like your traditional data system.
“While modern specialized data systems may seem distinct at first, at the core, they are all composed of a similar set of logical components.”
A minimal viable data system can be broken down into three main layers:
-
A User Interface
Users interact with this in order to initiate operations on data. This is typically exposed as a language frontend or API. -
An execution engine
The engine performs operations on the data, as specified by users. -
Data Storage
This is the layer that stores data that is available to users.

There is a lot of room for innovation within and between these layers. In a bespoke data system, one or more of these layers will be developed as custom software. But that does not mean that the whole layer needs to be engineered from scratch. One thing the BYODS movement made painfully clear is that building most of these layers from the ground up is not the optimal use of everyone’s time and money.
Many talented engineers are developing the same thing over and over again, building systems that are essentially minor incremental improvements over the status quo.
“We have companies that have a lot of money writing the same database system software over and over.”
Luckily, once they started building their own data systems, teams inside those pioneering companies started realizing the power of two opposing yet complementary forces:
- Open source gives you more choices
- Standards help you make better choices
Understanding how to balance these two forces has laid the groundwork for a paradigm shift in how we think about modern data system design.
1.2.1 Open Source Gives You More Choices
Early on in the BYODS movement, teams realized that you do not have to work on the entirety of the problem. Instead, you can work on slivers of the problem.
This realization enabled a number of open source projects to crop up and start gaining traction. Many of these open source projects took the Unix philosophy to heart: make each project do one thing well.
“Open source means each problem only has to be solved once.”
- Naval Ravikant(@naval), co-founder and chairman of AngelList
This realization enabled a number of open-source projects to crop up and start gaining traction. As Andy Pavlo put it:
“You are starting to see various projects, in the open source community or organizations, to break out portions of a modern OLAP system into standalone components that are open source that other people can build on, other people can take advantage of, and other people could reuse for their own new OLAP systems. The idea here is that instead of everyone building the same thing, let’s everyone work together on one thing, and have that be really really good and everyone gets to reap the rewards.”
One example of this kind of open source project is Apache Calcite, a query parser and optimizer that can be embedded in an execution engine. Their website headline says it best: “The foundation for your next high-performance database.” This means a team does not have to build their own query optimizer from scratch, one of the hardest pieces of a data system to implement well.
“There is probably no single person in the world who fully understands all subtleties of the complex interplay of rewrite rules, approximate cost models, and search-space traversal heuristics that underlie the optimization of complex queries.”
Open source projects like Calcite all started cropping up and gaining momentum as everyone in the trenches of the BYODS movement came to the same realization around the same time. Here are just some examples:
- Apache Iceberg (originally written for Netflix)
- Apache Parquet (originally written for Twitter and Cloudera)
- Orca (originally written for Greenplum)
- Trino (originally written for Facebook)
For data systems developers, open source was a lightbulb moment. First, it meant that engineering teams were not spread so thin trying to innovate across the entire surface area of the system. Instead, they could focus on innovating in targeted, high-impact areas. Second, it expanded their choices. For example, when building an engine, they could leverage Calcite or Orca as the query optimizer.
“If you have an apple and I have an apple and we exchange apples then you and I will still each have one apple. But if you have an idea and I have an idea and we exchange these ideas, then each of us will have two ideas.”
- George Bernard Shaw
The downside of having more choices is that ultimately you do have to make one. An evergreen problem with open source is: how do you choose? How do you choose which user interface, query optimizer, or database connector to build with and depend on? That is where standards come in.
1.2.2 Standards Help You Make Better Choices
If open source is the yin, standards is the yang. You may not realize it, but standards keep all of us sane everyday. These are just a few classic examples that you might not even realize are standards:
-
ISO 8601
(International Organization for Standardization for date and formats) -
HTTP
(Hypertext Transfer Protocol for web browsing) -
SMTP
(Simple Mail Transfer Protocol for email transmission) -
RSS
(Really Simple Syndication for syndicated content) -
SMS
(Short Message Service for text messages)
What is a standard? The Open Data Institute offers this definition:
“Standards are documented, reusable agreements that solve a specific set of problems or meet clearly defined needs.
Standards are used when it’s important to be consistent, be able to repeat processes, make comparisons, or reach a shared understanding.”
In technical domains, standards solve interoperability problems. As the IEEE (Institute of Electrical and Electronics Engineers) defines it, interoperability is the “ability of a system or a product to work with other systems or products without special effort on the part of the customer. Interoperability is made possible by the implementation of standards.”
“Technical standards are awesome. Standards help teams save time and money by giving them a common language for how their products can interact with other products”
In data systems, interoperability is an umbrella of problems that all boil down to how information moves through the different layers:
-
Compute on data
This requires data structures to represent datasets in-memory while they are being processed. -
Query interoperability
This requires an intermediate representation for sharing query plans that are portable and not dependent on a specific database or SQL dialect. -
System interoperability This requires serialization and data interchange interfaces (network wire protocols, database clients, etc.) for moving data.
Standards are the secret sauce for many developers who build data systems because most problems they need to solve are (a) common to the majority of data systems and (b) have an already agreed upon “best solution.”
“Data interoperability will always be a pain absent fast, efficient standards.”
1.3 Composing The Next Frontier
Where are we now? Should anyone BYODB today? Probably not.
“It is too expensive / time-consuming to build a DBMS from scratch.”
But what about the companies who have already taken the BYODB plunge - would they do it all over again, knowing what they know now? Also probably not.
As Wes McKinney put it, “building database systems is really hard.” While many of the early BYODB pioneers continue to reap the benefits of their bespoke data systems, they are also now keenly aware of the downsides. Here are just some of the challenges that teams at these companies now face:
“In databases, we are never happy.”
-
Maintenance is forever
Now, each company owns the infrastructure that, in the best-case scenario, many internal business units depend on. Those that have built bespoke solutions from the ground up have made their systems that much more expensive as a result. -
Performance is unpredictable
Optimizing system performance can feel like a game of whack-a-mole. The second you push forward on one performance area, another one becomes the bottleneck. -
Change is constant Every architecture decision has an expiration date. A healthy data system has an evolutionary architecture that supports constant change across every layer: user interfaces, engines, and data storage.
In search of data system nirvana, teams building their own data systems are closer, but not there yet. They are realizing that maybe the real data system nirvana was the open source projects and standards they met along the way.
1.3.1 A Composable Data System
Many lead developers at companies that have built their own systems are now advocating for a paradigm shift in data system design. The shift is to move away from building systems by coding first, and to instead start building by composing first.
Composability is a powerful concept.
“Because composability allows anyone in a network to take existing programs and adapt or build on top of them, it unlocks completely new use cases that don't exist in our world. In other words: composability is innovation.”
“Composability is to software as compounding interest is to finance.”
A composable data system is one that is designed by reusing available components. But one does not simply build a system from components that are just sitting around. Taking the lessons learned from the BYODB movement, a healthy composable data system balances the complementary forces of:
- open source components
- standards for data interoperability
“Considering the recent popularity of open source projects aimed at standardizing different aspects of the data stack, we advocate for a paradigm shift in how data management systems are designed.
We believe that by decomposing these into a modular stack of reusable components, development can be streamlined while creating a more consistent experience for users.”
1.3.2 Why Now?
The idea of modular, interoperable data systems is not new. Folks have been dancing around this topic for awhile, starting with A RISC-style Database System in 2000, The Deconstructed Database in 2018, and most recently, “The Unbundling of OLAP”.
The problem has been that composable data systems were only accessible to a handful of elite organizations. Few experts are capable of designing and implementing composable systems, and much of the software created has been squirreled away inside closed-source commercial products.
“We foresee that composability is soon to cause another major disruption to how data management systems are designed.”
What is new is that the pieces are now in place that help mere mortals build composable data systems. Because of the BYODB movement, we have:
- A reliable ecosystem of open source projects that can pave the way for others to “stand on the shoulders of giants.”
- Standards to ensure interoperability between components.
Because of the BYODB movement, people can now spin up better, faster, powerful data systems that would have taken years to build in the past. Because of the BYODB movement, composable systems are now what scientist Stuart Kauffman coined “the adjacent possible.”
“Innovative environments are better at helping their inhabitants explore the adjacent possible, because they expose a wide and diverse sample of spare parts, and they encourage novel ways of recombining those parts.”
- Steven Johnson, Where Good Ideas Come From: The Natural History of Innovation
1.3.3 Structure Of A Composable Data System
A composable data system has the same three core layers:
- A user interface
- An execution engine
- Data storage
You can think of these layers as the “do-ers” of a system. These are the tools we know and love, the logos you recognize, and the docs people spend hours combing through.
But in a composable system, the “gluers” as just as or even more important. The gluers are the core standards that glue the layers together:
-
Intermediate representation
A standard format for representing query plans -
Connectivity
A standard client API for accessing databases -
Dataset memory layout
A standard format for representing data in memory

Composable data systems are not just possible but also practical today due to the emergence of open standards in the data analytics space. Standards serve as the glue that is needed to bridge the gaps between the user interface, execution engine, and data storage layers.
In later chapters in this Codex, we will cover each of these standards in depth.
1.3.4 Benefits Of Composability
Here are some of the benefits shared by the engineering team at Meta, one of the companies at the leading edge of the composable movement:
“Over the last three years, we have implemented a generational leap in the data infrastructure landscape at Meta through the Shared Foundations effort. The result has been:
- a more modern, composable and consistent stack,
- with fewer components, richer features, consistent
- better performance for the users of our stack, particularly, machine learning and analytics.
We have deprecated several large systems and removed hundreds of thousands of lines of code, improving engineering velocity and decreasing operational burden.4
In The Composable Data Management System Manifesto (2023), a group of engineers at Meta, Voltron Data, Databricks, and Sundeck outlined the following benefits:
- Faster and more productive engineering teams
- Faster innovation cycles
- Co-evolution
- Better user experience
1.4 Standards Are The Glue
If standards are the glue that holds the components together in a composable data system, what happens without standards? How do the components in a system get wired together without standards? What do standards replace?
Generally, there are two groups of tools that can be used to help build interoperability into data systems:
- Standards
- Purpose-built adaptors, connectors, and protocols
Without standards as glue, your developers are the glue in your data system. Standards replace developer time solving problems that, in 95% of use cases, have already been solved.
Here are some of the types of purpose-built glue code that developers end up spending valuable time building and maintaining:
- Custom data serialization/deserialization interfaces between components
- Custom data transformation processes to convert from row-based to columnar formats and vice versa
- Custom connectivity interfaces for accessing data
- Custom transport protocols for moving data around
No one wants their core engineers working on one-off glue code projects, but this is the way systems are built today. Without standards, data systems developers will pay a steep penalty both in computational cost and development time to force components to interoperate with each other.
While “which user interface should we go with” or “which engine should we support” debates tend to get more air time, the real complexity and cost of a data system are the decisions about the layers of glue between the modules. In order to make a working data system, something is needed to glue the modules together to avoid daunting complexity.
“There is a need for ‘universal glue’ as well to enable components to cooperate and build value-added services.”
Of course, someone has to write the code to glue the layers together so they can connect, communicate, and exchange data with each other. But that person does not have to be you.
Nevertheless, you might be skeptical about standards (cue the requisite xkcd cartoon).
Standards can sometimes get a bad rap, and a dose of skeptism is healthy here. The key to wielding standards is to see them as a simple equation: rules + community. The first part, the rules, is easy. Anyone can just write up a set of rules, slap the label “standard” on it, and call it a day.
The second part, the community, is much harder. Standards need to fill a void, solve an important problem, and gain vocal advocates who see the value of standardization in that domain. Gaining traction and adoption in a community takes a lot of work. In fact, starting a standard is a lot like starting a movement.
“The first follower transforms a lone nut into a leader. If the leader is the flint, the first follower is the spark that makes the fire.
The second follower is a turning point: it’s proof the first has done well. Now it’s not a lone nut, and it’s not two nuts. Three is a crowd and a crowd is news…
Now here come two more, then three more. Now we’ve got momentum. This is the tipping point! Now we’ve got a movement!”
When shopping for standards, you are shopping for a movement to join. You probably don’t want to be the leader or the first follower. You might join a small crowd for your niche needs. But there are at least five important factors to consider:
- Open – Is the standard open to be inspected by self-selecting members of the community? Does it have an open source licence (see https://choosealicense.com/)?
- Community – Is there an active community that contributes to the standard?
- Governance – Does the standard have established governance? Is the group made up of people from more than one company?
- Adoption – Are people actually using the standard? Is there a list of organizations who are on the record about adopting the standard?
- Ecosystem - Are there software projects that build on top of and extend the standard?
1.4.1 Composable Standards
Composable standards are documented, reusable agreements that make it possible for components within a data system to connect seamlessly, without special work or effort on the part of the data system developers or users. In a data system, “connect” means at least three actions:
- communicate messages
- exchange data
- operate on data
Composable standards are used when it is important to be able to rely on the format of the data, messages, or instructions that are sent and received between components.
As many early BYODB pioneers found, it turns out to be quite difficult to build a system of components without composable standards for interoperability. More specifically, there are two levels of interoperability that matter for the design of data systems:
Levels of Interoperability | Description |
---|---|
Structural | Defines the syntax of the message exchange. This is the level where file formats, data formats, and communication protocols play the most important role. |
Semantic | Defines a shared vocabulary adopted by both sides of the exchange as a “common information exchange reference model.”5 |
A composable standard needs to go beyond structural interoperability. A composable standard needs to enable semantic interoperability, with data as the API. Adopting composable standards means that you do not need to think about converting your data into some intermediate interface with its own API so that other applications can access it.