Voltron Data Logo
About

Learn more about our company

Contact Us

Get in touch with our team

Theseus

  • How it Works

    Core concepts and architecture overview

  • Control Plane

    Kubernetes deployment guide and best practices

  • Query Profiler

    Analyze and optimize query performance

Arrow
Loading...

In-memory columnar data processing

Ibis
Loading...

Python dataframe API for multiple backends

RAPIDS
Loading...

GPU-accelerated data science and analytics

Dev Blog

Latest updates and technical insights

Benchmarks Report

Read about our 2024 benchmarks for our data engine, Theseus.

The Composable Codex

A 5-part guide to understanding composable

Try Theseus

Product

  • How it Works
  • Control Plane
  • Query Profiler

Resources

  • Blog
  • Composable Codex
  • Benchmarks

Getting Started

  • Test Drive

Theseus

Built for AI workloads, Theseus is a high-performance SQL engine with GPU acceleration.

© 2021-2025 Voltron Data, Inc. All rights reserved.

Terms of ServicePrivacy PolicyCookie Policy
Voltron Data Logo
About

Learn more about our company

Contact Us

Get in touch with our team

Theseus

  • How it Works

    Core concepts and architecture overview

  • Control Plane

    Kubernetes deployment guide and best practices

  • Query Profiler

    Analyze and optimize query performance

Arrow
Loading...

In-memory columnar data processing

Ibis
Loading...

Python dataframe API for multiple backends

RAPIDS
Loading...

GPU-accelerated data science and analytics

Dev Blog

Latest updates and technical insights

Benchmarks Report

Read about our 2024 benchmarks for our data engine, Theseus.

The Composable Codex

A 5-part guide to understanding composable

Try Theseus

Introducing Substrait: An Interoperable Data to Engine Connector

P

Phillip Cloud

J

Jacques Nadeau

C

Co-creator of Substrait

A

Apache Arrow

D

Dremio

March 4, 2022
Introducing Substrait: An Interoperable Data to Engine Connector

Over the past decade data has become larger and more complex, hardware more powerful and cost-efficient, and businesses more data-driven. At the same time, the proliferation of data tools, programming languages, computing engines, and hardware developed to process data has made it difficult to maximize the efficiency of data analytics workflows.

Every existing and new data analysis tool, like dplyr or ibis, needs unique connectors–or a multitude of imported drivers–to work with every new compute engine. This makes it difficult for organizations to adapt and shift in the data analytics space.

Enter Substrait.

Substrait seeks to create a cross-language, interoperable specification for data compute operations, connecting analysis tools with computing engines and hardware. Substrait will provide the same benefits to processing that Apache Arrow provides for sharing data between processes. Namely, a standard, flexible way for APIs and compute engines to share the representation of analytics computations.

Why Not Build This in Arrow?

Establishing a separate project helps achieve several goals.

First, this encourages inclusiveness and maximizes opportunities for collaboration among maintainers from projects beyond Arrow. Second, Substrait has huge potential to simplify the development and maintenance of countless projects by decoupling analytical frontends from hardware-driven backends. Third, being front-end and back-end agnostic, Substrait can be developed at its own pace, independent of any other project’s release cycles.

The Challenge

Coming back to the engineering problem at hand, the main issue is that every front-end analysis tool like dplyr in R or ibis in Python, currently must produce an engine-specific compute representation for each of the engines the front-end wants to execute against.

In the case of ibis, some backends produce SQLAlchemy expressions and other backends produce SQL strings, yet some other backends execute expressions against in-memory data. On the R side, dplyr (and a host of other frameworks) face the same problem of building and maintaining support for an ever-growing number of compute engines.

All of this is part of a larger problem that portends complexity and inefficiency. If solved, it can unlock the power of heterogeneous hardware, cost-effective multi-tenancy compute, multi-cloud and multi-backend execution.

Substrait is a critical part of making this happen, since it provides a specification for describing computation plans. The key idea is that any spec-conforming Substrait producer can execute a plan using any spec-conforming consumer.

So, What is Substrait?

Substrait is a specification. It is a list of primitives and descriptions of behaviors that producers and consumers implement to share analytics compute plans. This idea is not new and can be found in several places around us. Substrait is like the SWIFT protocol, a standard for communicating between banks worldwide: each bank uses different internal systems, yet people are able to send money between banks because of the shared protocol. Substrait fills a similar niche for analytics tools.

What’s a Producer?

A Substrait producer is a piece of software, typically a library, that produces Substrait plans. Drawing from the banking example, a producer is the bank that sends the money to another bank, and is tasked to write SWIFT messages to describe the transaction according to the agreed specification.

A concrete example of a producer is the ibis Python library. With ibis you write code to do some analytics:

python
con = ibis.sqlite.connect("my_table.db")
t = con.table("t")
results = t.groupby(t.key).aggregate(counts=t.x.count())
substrait_proto = results.compile()

Ibis, as a Substrait producer, would then walk the expression tree, producing Substrait protobuf messages at each node, and finally assembling a query plan.

We expect that most major languages that support interactive analytics (such as Python and R) will have producer implementations. In fact, we’ve started building out the Python producer here. Take a look!

What’s a Consumer?

A Substrait consumer, on the other hand, is a piece of software that consumes Substrait plans and does something with them; the consumer is the bank receiving the money, and it understands where to deposit them based on the SWIFT messages. Typically, the consumer would execute those plans, but by no means is that a requirement. A consumer might transform a plan and pass it on to the next phase of an execution pipeline.

At Voltron Data, we are developing a Substrait consumer for Arrow’s compute facilities. Typically, a consumer will turn the protobuf messages into their representation and in this case, the transformation is from Substrait messages to ExecPlans in the Arrow C++ implementation.

We are extremely excited to see what developers will build with Substrait!

The Internals

Substrait specifies a collection of objects that model abstract analytics compute primitives. The full specification can be found here.

Types

Substrait specifies a list of the most common types that occur in analytics computations, including typical primitives like 64-bit integers and floating-point types, as well as complex types like lists, maps, and structs.

Support for user-defined types exists as extension types. The Arrow SNE consumer leverages these to implement the large variety of types supported in Arrow.

Expressions

Expressions form the domain-specific part of an analytics computation.

Substrait supports a small set of expression types that seek to capture the vast majority of use cases. Right now that includes field references like my_table.my_field, and a few different types of functions such as scalar, aggregate, and window functions. Support for slightly more exotic things like table functions and UDFs are planned.

Relations

Relations form the backbone of analytics computations.

Substrait supports a small set of logical relations necessary for analytics workloads, including joining, filtering, grouped aggregation, and others. Notably, correlated subqueries are not yet implemented but are expected to be done soon.

Serialization

Substrait specifies a binary representation using protocol buffers and there are plans for a text representation as well.

Right now, the focus is on stabilizing the binary representation.

Get Involved!

We are excited to work on Substrait, and we think it will become the primary way to describe and execute analytics compute workloads. If this excites you, join us at Voltron Data and be part of this journey!

You can also get involved by joining the Substrait community that is growing every day. Here are some ways to get connected:

  • Slack
  • GitHub Issues and Discussions
  • Follow @substrait_io on Twitter
  • Documentation

Check out the community page for additional details!


Photo source: https://www.pexels.com/photo/close-up-photography-of-spider-web-167259/

Product

  • How it Works
  • Control Plane
  • Query Profiler

Resources

  • Blog
  • Composable Codex
  • Benchmarks

Getting Started

  • Test Drive

Theseus

Built for AI workloads, Theseus is a high-performance SQL engine with GPU acceleration.

© 2021-2025 Voltron Data, Inc. All rights reserved.

Terms of ServicePrivacy PolicyCookie Policy