Mar 04, 2022

Introducing Substrait: An Interoperable Data to Engine Connector

Phillip Cloud and Jacques Nadeau, Co-creator of Substrait, Apache Arrow and Dremio

Free Close Up Photography of Spider Web Stock Photo

Over the past decade data has become larger and more complex, hardware more powerful and cost-efficient, and businesses more data-driven. At the same time, the proliferation of data tools, programming languages, computing engines, and hardware developed to process data has made it difficult to maximize the efficiency of data analytics workflows.

Every existing and new data analysis tool, like dplyr or ibis, needs unique connectors–or a multitude of imported drivers–to work with every new compute engine. This makes it difficult for organizations to adapt and shift in the data analytics space.

Enter Substrait.

Substrait seeks to create a cross-language, interoperable specification for data compute operations, connecting analysis tools with computing engines and hardware. Substrait will provide the same benefits to processing that Apache Arrow provides for sharing data between processes. Namely, a standard, flexible way for APIs and compute engines to share the representation of analytics computations.

Why Not Build This in Arrow?

Establishing a separate project helps achieve several goals.

First, this encourages inclusiveness and maximizes opportunities for collaboration among maintainers from projects beyond Arrow. Second, Substrait has huge potential to simplify the development and maintenance of countless projects by decoupling analytical frontends from hardware-driven backends. Third, being front-end and back-end agnostic, Substrait can be developed at its own pace, independent of any other project’s release cycles.

The Challenge

Coming back to the engineering problem at hand, the main issue is that every front-end analysis tool like dplyr in R or ibis in Python, currently must produce an engine-specific compute representation for each of the engines the front-end wants to execute against.

In the case of ibis, some backends produce SQLAlchemy expressions and other backends produce SQL strings, yet some other backends execute expressions against in-memory data. On the R side, dplyr (and a host of other frameworks) face the same problem of building and maintaining support for an ever-growing number of compute engines.

All of this is part of a larger problem that portends complexity and inefficiency. If solved, it can unlock the power of heterogeneous hardware, cost-effective multi-tenancy compute, multi-cloud and multi-backend execution.

Substrait is a critical part of making this happen, since it provides a specification for describing computation plans. The key idea is that any spec-conforming Substrait producer can execute a plan using any spec-conforming consumer.

So, What is Substrait?

Substrait is a specification. It is a list of primitives and descriptions of behaviors that producers and consumers implement to share analytics compute plans. This idea is not new and can be found in several places around us. Substrait is like the SWIFT protocol, a standard for communicating between banks worldwide: each bank uses different internal systems, yet people are able to send money between banks because of the shared protocol. Substrait fills a similar niche for analytics tools.

What’s a Producer?

A Substrait producer is a piece of software, typically a library, that produces Substrait plans. Drawing from the banking example, a producer is the bank that sends the money to another bank, and is tasked to write SWIFT messages to describe the transaction according to the agreed specification.

A concrete example of a producer is the ibis Python library. With ibis you write code to do some analytics:

con = ibis.sqlite.connect("my_table.db")
t = con.table("t")
results = t.groupby(t.key).aggregate(counts=t.x.count())
substrait_proto = results.compile()

Ibis, as a Substrait producer, would then walk the expression tree, producing Substrait protobuf messages at each node, and finally assembling a query plan.

We expect that most major languages that support interactive analytics (such as Python and R) will have producer implementations. In fact, we’ve started building out the Python producer here. Take a look!

What’s a Consumer?

A Substrait consumer, on the other hand, is a piece of software that consumes Substrait plans and does something with them; the consumer is the bank receiving the money, and it understands where to deposit them based on the SWIFT messages. Typically, the consumer would execute those plans, but by no means is that a requirement. A consumer might transform a plan and pass it on to the next phase of an execution pipeline.

At Voltron Data, we are developing a Substrait consumer for Arrow’s compute facilities. Typically, a consumer will turn the protobuf messages into their representation and in this case, the transformation is from Substrait messages to ExecPlans in the Arrow C++ implementation.

We are extremely excited to see what developers will build with Substrait!

The Internals

Substrait specifies a collection of objects that model abstract analytics compute primitives. The full specification can be found here.

Types

Substrait specifies a list of the most common types that occur in analytics computations, including typical primitives like 64-bit integers and floating-point types, as well as complex types like lists, maps, and structs.

Support for user-defined types exists as extension types. The Arrow SNE consumer leverages these to implement the large variety of types supported in Arrow.

Expressions

Expressions form the domain-specific part of an analytics computation.

Substrait supports a small set of expression types that seek to capture the vast majority of use cases. Right now that includes field references like my_table.my_field, and a few different types of functions such as scalar, aggregate, and window functions. Support for slightly more exotic things like table functions and UDFs are planned.

Relations

Relations form the backbone of analytics computations.

Substrait supports a small set of logical relations necessary for analytics workloads, including joining, filtering, grouped aggregation, and others. Notably, correlated subqueries are not yet implemented but are expected to be done soon.

Serialization

Substrait specifies a binary representation using protocol buffers and there are plans for a text representation as well.

Right now, the focus is on stabilizing the binary representation.

Get Involved!

We are excited to work on Substrait, and we think it will become the primary way to describe and execute analytics compute workloads. If this excites you, join us at Voltron Data and be part of this journey!

You can also get involved by joining the Substrait community that is growing every day. Here are some ways to get connected:

Check out the community page for additional details!



Photo source: https://www.pexels.com/photo/close-up-photography-of-spider-web-167259/