Open standards over silos
The second half of the battle is learning how to use standards to build towards a composable data system. Developing with standards is a lot like using anchors while rock climbing: they help you chart a course to your final destination, and offer protection when changes need to happen in the future.
What are open standards? Keep reading below.
Open standards over silos
- Standards
- Modules
- Modular
- Interoperable
- Customizable
- Extensible
- Data Systems
- Arrow
- Substrait
- ADBC
"The early bird gets the worm, but the second mouse gets the cheese."
Willie Nelson
1.0 The quiet power of open standards
It can be hard to be a champion for standards. Standards have zero marketing budget. Standards are also not the kind of technology that gets most people excited: the excitement often comes later from seeing software that is built on top of the standards.
But that may be exactly the charm of standards. And in the history of revolutionary technology that has changed our world, standards have actually played a quiet starring role. As one example, open standards were key to the early success of the World Wide Web:
It was the standardization around HTML that allowed the web to take off. It was not only the fact that it is standard but the fact that it is open and royalty-free. If HTML had not been free, if it had been proprietary technology, then there would have been the business of actually selling HTML and the competing JTML, LTML, MTML products… Yes, we need standards, because the money, the excitement is not competing over the technology at that level. The excitement is in the businesses and the applications that you built on top.
Tim Berners-Lee, inventor of the World Wide Web
Standards are also critical to how data systems work, but they do not always get the recognition they deserve in the “modern data stack.”
For engineers, celebrating the adoption of a new standard may feel like something you throw in an internal Slack message, rather than share with the rest of the world. It is hard to convey just how good it feels to remove thousands of lines of glue code, and replace it with a few API calls to a standard library. We want to briefly honor that feeling by highlighting two success stories of standards shared by developers.
In 2021, the Streamlit development team wrote about a big shift in their infrastructure to Arrow, a standard we will dive into later in this chapter:
In our legacy serialization format, as DataFrame size grew, the time to serialize also increased significantly… Just compare the performance of our legacy format vs Arrow. It's not even funny!
This one is mostly a benefit for us, Streamlit devs: we get to delete over 1k lines of code from our codebase. You can't believe how good this feels 😃
We actually can believe it 🤠
Similarly, a team of engineers at Meta, many of whom have been leading the composable data systems movement, wrote in their article “Shared Foundations: Modernizing Meta’s Data Lakehouse”:
Over the last three years, we have implemented a generational leap in the data infrastructure landscape at Meta through the Shared Foundations effort. The result has been:
- a more modern, composable, and consistent stack,
- with fewer components, richer features, consistent interfaces, and
- better performance for the users of our stack, particularly, machine learning and analytics.
We have deprecated several large systems and removed hundreds of thousands of lines of code, improving engineering velocity and decreasing operational burden.
When should you start to search for standards? Here is a good rule of thumb:
When things don't work as they should, it often means that standards are absent.
So, saving developers from writing and maintaining thousands of lines of glue code is great, but what is in it for the company? In The Composable Data Management System Manifesto (2023), a group of engineers at Meta, Voltron Data, Databricks, and Sundeck outlined the following benefits of developing on top of an open standards ecosystem:
- Faster and more productive engineering teams – Less duplicated work means more time for innovation.
- Tighter innovation cycles – Targeted feature development on a smaller code base means faster releases.
- Co-evolution of database software and hardware – Unifying the core layers means better performance and scalability.
- Better user experience – More consistent interfaces and semantics means a smoother user experience.
With all these gains, why do standards fly under the radar? We suspect it is partially because these gains happen invisibly - no new algorithm, no shiny new product, no press release, just thousands of lines of glue code deleted and developers who can sleep easier at night. Because of this, a good standard can be hard to find*. In the rest of this chapter, we will surface the core standards that we trust and build composable data systems with.
*If you are already sold on standards and want to find a golden one fast, skip ahead to 1.2: Standardizing with Arrow.
1.1 Shifting to open standards
In Chapter 00 “A New Frontier”, we shared a starter definition of standards. To adapt to the world of data systems, we arrive at a new definition:
Standards are documented, reusable agreements that make it possible for components within a data system to connect, without special work or effort on the part of the data system developers or users. In a data system, "connect" means two specific actions:
- Exchange data
- Communicate messages or instructions
Standards are used when it is important to be able to rely on the format of the data, messages, or instructions that are sent and received between components.
To support a minimum viable composable data system, three core standards are needed:
While “which user interface should we go with?” or “which engine should we support?” debates tend to get more air time, the real complexity and cost of a data system racks up when you sleep on standards and wait on worrying about interoperability.
1.1.1 Hacking data interoperability
At a very high level, there are two groups of tools that can be used to help build interoperability into data systems:
- Standards
- Glue code like "purpose-built" adaptors, connectors, and protocols
What is glue code?
"Executable code that serves solely to ‘adapt’ different parts of code that would otherwise be incompatible. Glue code does not contribute any functionality towards meeting program requirements."
Glue code may not sound all that bad. But as soon as you add more than one component to the mix, these two tools turn into two very different system-scapes:
- A structured ecosystem of standards and interoperable components
- A spaghetti architecture of adaptors, connectors, and protocols that carry a combinatorial explosion of maintenance
Source: xkcd.com
Here are some of the types of glue code that developers end up spending valuable time building and maintaining, versus building with a standard that bakes in all the good design decisions made by an expert community:
Glue Code Solution | Example Standards |
---|---|
Custom data serialization/deserialization interfaces between components | Use Arrow for zero-copy reads for data access in shared memory |
Custom data transformation processes to convert from row-based to columnar formats and vice versa | Use Arrow standard columnar data format |
Custom transport protocols for moving data over a network | Use Arrow Flight as a standard to efficiently move data in Arrow format over the wire |
Custom interfaces to databases to establish connections, execute queries, and process results | Use Arrow Database Connectivity (ADBC) protocol to interface with databases and fetch result sets in Arrow format |
Who wants talented engineers working on one-off glue code projects? The answer is that no one wants that, and no one plans for that to happen. And yet, when systems are not designed on top of standards, this is what inevitably will happen.
There is a need for 'universal glue' as well to enable components to cooperate and build value-added services.
Without standards, developers become responsible for the glue in a data system. Without standards, developers will pay a steep penalty both in computational cost and development time to force components to interoperate with each other. Standards can help replace developer time solving problems that, in many use cases, have already been solved.
1.1.2 Seeking data standards that stick
Of course, someone has to write code that glue the system layers together. But that person does not have to be you. This is where tried and true standards come in, and why engineers are happy to merge standards into their systems. Nevertheless, you might be skeptical about standards (cue the requisite xkcd cartoon).
Standards can sometimes get a bad rap, and a dose of skepticism is healthy here.
For me, I like fundamental ideas. I like best practices. There’s a saying I like: "Innovate where you can. Where you can’t, use the industry standards." For SMEs, instead of chasing fancy new things, I think they should choose less fancy but more stable solutions.
The key to understanding standards is to see them as a simple equation: rules + community. The first part, the rules, seems easy enough. Anyone can just hammer out some software specs, slap the label “standard” on it, and call it a day. But the rules need to be actually good. A good standard has rules that are:
- Documented
- Consistent
- Reusable
- Stable
The second part, the community, is even harder. Standards need to solve an important problem, and gain vocal, visible advocates who see the value of standardization in that domain. Gaining traction and adoption in a community takes a lot of work. In fact, starting a standard is a lot like starting a movement.
Here is what it might look like to start a standards movement:
The first follower transforms a lone nut into a leader. If the leader is the flint, the first follower is the spark that makes the fire. The second follower is a turning point: it's proof the first has done well. Now it's not a lone nut, and it's not two nuts. Three is a crowd and a crowd is news… Now here come two more, then three more. Now we've got momentum. This is the tipping point! Now we've got a movement!
When shopping for standards, you are shopping for a movement to join. You probably don’t want to be the leader and maybe not even the first follower. You might join a small crowd for your niche needs. But there are at least five important factors to consider:
- Open – Is the standard open to be inspected by self-selecting members of the community? Does it have an open source license (see https://choosealicense.com/)?
- Community – Is there an active community that contributes to the standard? Has the standard evolved since it was created? Can it adapt when the world changes?
- Governance – Does the standard have established governance? Is the group made up of people from more than one company?
- Adoption – Are people actually using the standard? Is there a list of organizations that are on the record about adopting the standard?
- Ecosystem - Is there an ecosystem of software projects that build on top of and extend the standard?
1.2 Targeting open standards: Arrow hits the mark
Arrow is an open source project that enables developers to efficiently build fast, interoperable data systems based on open standards. The Arrow project ticks all the boxes for a solid standard:
1.2.1 The Arrow format
Arrow started as a standardized in-memory format for structured tabular data. Why start there? Because when you are building data-intensive analyses and applications, systems get stuck on two main tasks:
-
Moving data
When a workload is transport-bound (or input/output[I/O]-bound), the speed of execution depends on the rate of transfer of data into or out of a system. -
Processing data
When a workload is compute-bound, the speed of execution depends on the speed of the processor, whether it is a CPU, GPU, or another type of hardware.
The Arrow format was designed to accelerate both of these processes. In a data system, the way data is arranged for processing can make a big difference, especially with the way modern processors, like CPUs and GPUs, work. The Arrow data format design improves the performance of processing tabular data on CPUs and accelerated hardware.
It's how you arrange the data in memory for processing. It's how it fits in your computer's RAM. You can also put it on disk and load it into memory without having to do any conversions or deserialization, which is a very helpful feature in building systems.
Wes McKinney, The Data Analytics Roundup
Processors are most efficient when data is laid out in a columnar format in memory. Why is columnar better?
What | How | Why |
---|---|---|
Better I/O | Reading and writing only the columns that are needed for a particular query | Each column can be stored separately, so the processor only needs to read and write the columns that are needed |
Lower memory usage | Storing only the values for each column, rather than the entire row | Each column can be stored separately, so the processor only needs to store the columns that are needed |
Significantly faster computation | Allowing processors to process data in parallel | Multiple elements in a column (vector) can be processed simultaneously through vectorized execution, taking advantage of multi-core processors |
While performance was key to drawing Arrow’s first followers, the Arrow movement gained momentum when developers realized that sharing the same data format could rescue them from their spaghetti junction systems:
- The Arrow format is the same across libraries, so you can share data without copy between processes.
- It is also the same format on the wire, so you can pass data around the network without the costs of serialization and deserialization.
It doesn't matter if it’s in your process or in my process, we have the exact same data representation. We can build primitives together. We can ferry the data over each other via IPC, and there is no serialization/deserialization.
Felipe Aramburú, speaking at the Carnegie Mellon University Database Group (2018)
The first wave of avid Arrow developers drove adoption of the standard to where we are today:
As the industry standardizes on Arrow for in-memory data representation, the challenge of how data is shared across these new platforms is solved.
Apache Arrow has recently become the de-facto standard for columnar in-memory data analytics, and is already adopted by a plethora of open-source projects.
I think the Apache Arrow umbrella of projects represents the common API around which current and future big data, OLAP, and data warehousing projects will collaborate and innovate.
1.2.2 Composable systems with Arrow
Next, developers started composing data systems around Arrow. But, since Arrow-formatted data was fast for computation, the bottleneck became getting the data into system memory in order to run compute. The struggle was that data tended to be stored elsewhere. In modern data systems, this was increasingly likely to be cloud storage. This meant that there were two major points of friction in a typical data workflow:
- From storage to system memory: Getting the data into the Arrow format for in-memory computation often meant that developers needed to write glue code to convert or preserve the Arrow format up and down the layers of the stack.
- From system A memory to system B memory: Because most pipelines involved moving data through the layers of the stack via multiple processes or systems, engineers were writing glue code to de- and re-format the data so that each system could operate on it.
These composability needs propelled the development of the composable Arrow ecosystem, which now provides a suite of tools to move data fast between the layers in a typical data system because the data stays in the Arrow format.
I'm going to continue to be the ultimate radical, however, and declare that the approach that we're taking today in terms of machine learning is still roughly the approach of the internal combustion engine in the automobile. The approach that's happening where Arrow ties together those predictive systems with declarative databases, that's really the creation of the hybrid, or the Prius era.
Developers can build their Arrow hybrid cars in one of two ways:
- Augment data systems: developers can layer Arrow into existing systems by developing with open source Arrow standards & components individually.
- Compose Arrow-native data systems: developers who are starting from scratch can build complete systems with Arrow standards & components at their core.
As George Fraser, CEO of Fivetran, noted:
Arrow is the most important thing happening in the data ecosystem right now. It's going to allow you to run your choice of execution engine, on top of your choice of data store, as though they are designed to work together. It will mostly be invisible to users… As Arrow spreads across the ecosystem, users are going to start discovering that they can store data in one system and query it in another, at full speed, and it's going to be amazing.
1.2.3 A composable ecosystem
As Arrow grew, so did the composable ecosystem of standards. Used together, these standards fulfill the main requirements for building composable data systems.
As we will see in the coming chapters in The Codex, these three core standards act as the backbone of a composable data system:
↓ | Types of standards | Standards |
---|---|---|
A | Intermediate representation | Substrait allows any user-interface that produces Substrait to pass the compute operations to a Substrait-consuming execution engine. You could swap any Substrait compatible user interfaces or execution engine. |
B | Connectivity | Arrow Database Connectivity (ADBC) ensures that no matter where the computation is performed the data will be returned in the Arrow format. You can swap your execution engine and know that your downstream code will still work. |
C | Data memory layout | The Arrow in-memory data format ensures that the data can pass from the storage to the engine (and even across the systems in a distributed environment) and back to the user without slowing down to serialize and deserialize. |
1.3 The data systems hierarchy of needs
Change is not just hard, though. Many organizations we work with have accumulated a lot of scar tissue around making changes to their data infrastructure. These initiatives take time, grit, and resources, and still, even when those stars align, the actual execution can get stuck in organizational quicksand. System changes can feel risky technically and politically. Paraphrasing one engineer: “People and politics are the bottleneck. Not performance.”
We think of shifting to a composable system as climbing the data system hierarchy of needs pyramid (inspired by the AI hierarchy of needs by Monica Rogati). To climb the pyramid, all you need is MICE.
While these concepts might be canon for software engineers, the edges between them can get blurry when the focus turns to data systems. Organized as a hierarchy means:
- Each level in the pyramid forms a strong foundation for leveling up
- Each level opens up new possibilities that were not possible before
What new possibilities open up at each level? From the bottom to top:
Need | Description | Benefit |
---|---|---|
Modular | Components can be added, removed, or replaced without affecting the rest of the system. | Changing a component has no or minimal impact on other components. |
Interoperable | Components can communicate with each other to exchange and make use of information, using well-defined interfaces. | Connect components from different systems. |
Customizable | Components can be combined in unique patterns to satisfy specific needs. | Tailor a system to specific needs. |
Extensible | Components can be easily modified to add new features or functionality. | Keep a system up-to-date with changing requirements. |
1.3.1 Modular
A module is a “chunk of reusable software” that does one thing well. The neat thing about a composable data system is you get to pick your own modules. The annoying thing about a composable data system is you get to pick your own modules.
A perfect module is a software component that is self-contained with a well-defined interface to make it. But no module is perfect. A good first step is to take stock of the modules in your existing system: which ones you have in place and which ones you want to augment or replace. Part of this process is to do a quick “smell test” of modules you have. Here are some questions you can ask of your modules:
- Are you focused? - Are you trying to do too many things? Are you making it difficult for other modules to do their jobs?
- Are you independent? - A dependent module will be greedy. It will limit your choices for modules in layers above and below it, and create a silo for users. An independent module, on the other hand, usually has a documented list or matrix to demonstrate the range of choices you are allowed. For example, the documentation for the Ibis dataframe framework in Python features a backend support matrix.
- Are you interchangeable? - A good rule of thumb is to aim for at least two modules for every layer - any time you have a single module for a given layer of your system, you have a single point of failure. For example, does your execution engine allow you to swap in different types of data storage, or is it married to one type of storage?
Once you have a good sense of your modules, you can start thinking about how to wire them together with standards.
1.3.2 Interoperable
These are the three core standards in a healthy composable data system:
- Intermediate representation - a standard representation for query plans
- Connectivity - a standard for accessing databases
- Data memory layout - a standard format for representing data in memory
In practice, the data memory layout should be the first that you adopt because it is core to how your data system functions. To enhance speed and reduce bottlenecks, the other standards should preserve the data format as information gets passed up and down the layers in the stack.
If your organization has started down a path of “Why can’t we just pick one [ insert: UI, SQL dialect, engine, etc. ]
”, a good strategy is to move the discussion away from unification around modules, and instead steer the collective energy toward unifying around standards. Then, standards serve as guardrails for considering (or re-considering) modules with confidence. For example:
Instead of… | Try… |
---|---|
Choosing a single user interface (UI) or programming language that everyone in the organization has to use | Adopting an IR standard and choosing UIs that produce it |
Building or selecting a single engine for data processing | Choosing engines that can consume the IR standard you adopt |
Attempting to unify data storage | Choosing a universal connectivity standard that allows users to access data wherever it is stored |
1.3.3 Customizable
A truly composable system will not look like any other organization’s data system. There are two ways that composable systems can be customized:
- Top-down customization: With the right mix of standards in place, data system designers can create a completely bespoke system by simply reusing available modules. The components may be totally generic or 100% bespoke, but no matter what, the system you compose will operate like a custom system in the ways that matter the most for your application.
- Bottom-up customization: If you have played your cards right and have interoperable modules in place with composable standards, then your system’s users can also customize the system as they need, on the fly. For instance, using Ibis as an example, a user can change a single line of code to switch databases:
import ibis
# 1. Define a query
def query(con):
table = con.table("table_name")
return (
table
.filter(...)
.aggregate(
by=[..., ...],
count=_.count(),
)
.order_by([..., ...])
)
# 2. Set up your connections
## with Postgres
con_pg = ibis.postgres.connect(database, host, user, password)
## with DuckDB
con_duckdb = ibis.duckdb.connect()
con_duckdb.register(file, table_name)
## with Spark
session = SparkSession.builder.getOrCreate()
con_spark = ibis.pyspark.connect(session)
query(con_pg).execute()
query(con_duckdb).execute()
query(con_spark).execute()
1.3.4 Extensible
Many organizations have niche needs that go beyond what any commercial or open source data system components can offer. Extensibility in data systems means that one or more components are designed in a way that allows developers to add new capabilities or functionality to meet their needs. At the top of the pyramid of needs, developers have a couple of options to extend components in the system:
- Piggybacked extensions - A piggybacked extension is one that builds on top of an existing module or component that fits the mold of your standards ecosystem. The main requirements here are that the module software is open source, and has a license that allows users to reuse the code. For example, Microsoft piggybacked on the Ibis framework for Python to develop their Magpie data science middleware. Similarly, Google piggybacked on Ibis for their BigQuery DataFrames.
- Greenfield extensions - A greenfield extension is one that is developed from scratch to meet a niche need. In a composable system, this is possible when a standard is well documented. For instance, TorchArrow uses the Arrow format internally to represent PyTorch data frames. GeoArrow is another extension of the Arrow format to represent geospatial data.
1.3.5 MICE: Composability through open standards
Composable data systems are MICE: modular, interoperable, customizable, and extensible. They are made up of interchangeable modules that ensure interoperability through open standards for exchanging and operating on data. Building systems on top of the Arrow data format standard means that all modules are able to exchange and operate on data with the same representation, making data exchange fast and smooth. The resulting system is flexible and can evolve, but also resilient because as the world changes, standards adapt.
1.4 Build your stack on top of open standards
If you found your way to this chapter of The Composable Codex first, go back to Chapter 00 A New Frontier, and learn about the work done by companies like Meta, Walmart, and more that have led us to this point.
Read Chapter 02: Bridging Divides
Choose your battles wisely. Learn how standards bring composability and interoperability to systems.
Access all of The Composable Codex chapters.
1.4.1 How can I keep up with Voltron Data?
You can keep up with all things Voltron Data by following us on LinkedIn and X. If you want to receive alerts for future content like The Composable Codex, sign up for our email list.
1.4.2 Who wrote The Composable Codex?
The Codex is written by engineers, data scientists, and leaders at Voltron Data. You can find out more about what we do at Voltron Data here: https://voltrondata.com/product
1.4.3 How can I cite The Codex?
For attribution, please cite this work as:
Voltron Data. 2023. The Composable Codex. https://voltrondata.com/codex
@ONLINE{composablecodex,
author = {Voltron Data},
title = {"The Composable Codex"},
year = {2023},
url = {https://voltrondata.com/codex},
langid = {en}
}