Ibis and Substrait: Supercharging Portability

Work in Your Language of Choice
Multi-language tools allow users the freedom to work where they want, how they want, unleashing greater productivity. We design and build composable data systems that enable users to write code once, in any language, on any device, and at any scale to develop pipelines faster.

Recently, the folks over at the Ibis Project wrote this wonderful blog about using Ibis and Substrait. In the blog, they lay out how Substrait allows a new, standardized route for the Ibis interface to connect to backends, potentially enabling Ibis development to target Substrait instead of targeting backends individually. They gave a demo of using Ibis to generate Substrait for DuckDB, and laid out their opinions on why Substrait is worthwhile, even with SQL around; the three big takeaways were that Substrait offered standardization, extensibility, and ease of parsing. Substrait is an intermediate representation (IR) for data analytics, offering a standard language for tools to talk to tools without having to think of what the other needs. This is exactly what we’d hope to see from an IR in this field, but wanted to test it ourselves. If Substrait is suitable for the task, then we can write Ibis while simply not thinking about a given backend, then send over the plan.

So, that’s what we’ll do today! We’ll take Ibis, and route it to DataFusion.

Ibis to Substrait

We’ll perform all the same tasks outlined in the blog mentioned at the beginning, but with a key difference. You can refer to it here.

Much as Gil Forsyth says there, Substrait is in its early days — support isn’t complete for all producers and consumers. The current release of DataFusion doesn’t support casting operations issued by Substrait (though that’s already in progress: https://github.com/apache/arrow-datafusion/pull/5137). Instead of casting the ratings data as part of the workflow, we modify imdb_ratings.parquet to classify the averageRating column as float64, and the numVotes column as int32. You can do this in your tool of choice — we recommend Apache Arrow! This also means that instead of:

rust

ratings = ratings.select(
    ratings.tconst,
    avg_rating=ratings.averageRating.cast("float"),
    num_votes=ratings.numVotes.cast("int"),
)

We have:

rust

ratings = ratings.select(
    ratings.tconst,
    avg_rating=ratings.averageRating,
    num_votes=ratings.numVotes,
)

Otherwise, everything from Ibis to Substrait is identical! As Substrait support matures, there will be fewer and fewer cases where there needs to be a modification to front-end code to accommodate for the backend. In the not-too-distant future, we hope to see a world where any base Substrait operation is supported by all producers and consumers of the IR, allowing for frictionless usage.

Substrait to DataFusion

For the time being, to run a Substrait plan in DataFusion, we need to go to Rust. However, that will change in an upcoming release.

In the meantime, we’ll go ahead and use Rust — with Substrait, we have no need to do something like pass around Python objects. We just need the filesystem.

💡 Note that in the following code, we assume familiarity with Rust, and do not supply the complete program.

Setup

For reference, we used these imports:

rust

use datafusion_substrait::consumer::from_substrait_plan;
use datafusion_substrait::serializer;

use datafusion::physical_plan::execute_stream;
use datafusion::prelude::*;

use datafusion::arrow::csv::writer::Writer;

use std::fs::File;

use futures::StreamExt;

Ingesting Data

First, we need to get the table in, with the same name used in the plan. Substrait has few requirements, but this is one — otherwise, there would be more syntax needed on all ends to identify tables.

rust

let mut ctx = SessionContext::new();
ctx.register_parquet("basics", "/Users/kaesuarez/Data/imdb/imdb_basics.parquet", ParquetReadOptions::default())
        .await?;
ctx.register_parquet("ratings", "/Users/kaesuarez/Data/imdb/imdb_casted_ratings.parquet", ParquetReadOptions::default())
				.await?;

Ingesting Substrait

Next, we’re going to get the plan into DataFusion. Here, we’re going to focus on the aggregation plan, but replacing the input plan and output path would handle the filtering case:

rust

// First, get it off of bytes on disk and into a Substrait plan.
let proto = serializer::deserialize("topfilms.proto").await?;

// Next, map it to a DataFusion plan.
let plan = from_substrait_plan(ctx, &proto).await?;

Execution

From this point, this is all just DataFusion — the actual consumption of the plan is over. No scrabbling to make input work, the plan is simply ingested. The next block takes the plan, runs it, and writes to CSV:

rust

let state = ctx.state();

let phys_plan = state.create_physical_plan(&plan).await?;

let mut output = execute_stream(phys_plan, ctx.task_ctx())?;

let file = File::create("agg_plan_out.csv").unwrap();

let mut writer = Writer::new(file);

while let Some(batch) = output.next().await {
    writer.write(&batch?)?;
}

For reference, here’s the table from the Ibis article:

rust

tt0111161	9.3	2651547	The Shawshank Redemption	1994
tt0068646	9.2	1838044	The Godfather	1972
tt0468569	9.0	2623735	The Dark Knight	2008
tt0167260	9.0	1827464	The Lord of the Rings: The Return of the King	2003
tt0108052	9.0	1343647	Schindler’s List	1993
tt0071562	9.0	1259465	The Godfather Part II	1974
tt0050083	9.0	782903	12 Angry Men	1957
tt0110912	8.9	2029684	Pulp Fiction	1994
tt1375666	8.8	2325417	Inception	2010
tt0137523	8.8	2096752	Fight Club	1999

And here is the output CSV from DataFusion (slight differences are due to data updating regularly):

rust

tt0111161	9.3	2696853	The Shawshank Redemption	1994
tt0068646	9.2	1871404	The Godfather	1972
tt0468569	9	2670549	The Dark Knight	2008
tt0167260	9	1857367	The Lord of the Rings: The Return of the King	2003
tt0108052	9	1363482	Schindler’s List	1993
tt0071562	9	1279166	The Godfather Part II	1974
tt0050083	9	796371	12 Angry Men	1957
tt0110912	8.9	2069981	Pulp Fiction	1994
tt1375666	8.8	2369147	Inception	2010
tt0137523	8.8	2141856	Fight Club	1999

To quote Gil, “that looks about right to me.” No need for using bridge code between Ibis and DuckDB or Ibis and DataFusion in particular — Substrait alone provides the bridge. Also, a nice bonus comes in the fact that this plan is around as long as it’s saved to disk. If the data updates and the same compute needs to be run again, be it on one of these backends or any other, the same plan file can just be passed along with the fresh data — no need to re-run the producer!

Ibis and Substrait individually offer opportunities for standardization in data analytics stacks, and together they’re even better — Ibis becomes the interface, Substrait the language in the middle, then any back-end handles execution. Both have a lot of momentum, and this connection is only one of the things we’re excited about.

To learn more about Ibis, check out the project page. Also make sure to check out Substrait while you’re at it, especially if you develop front or back-ends — it may be the solution you need to cut down on constant implementation work spent on supporting other tools. If you’re working with either and want to accelerate your success, learn how a Voltron Data Enterprise Support subscription can help you.

Photo by Shiro Hatori

Theseus

Theseus

Ibis and Substrait: Supercharging Portability

Ibis to Substrait

Substrait to DataFusion

Setup

Ingesting Data

Ingesting Substrait

Execution