Feb 22, 2023

Ibis and Substrait: Supercharging Portability

Kae Suarez

timelapse of car tail lights around road bend at night
Work in Your Language of Choice
Multi-language tools allow users the freedom to work where they want, how they want, unleashing greater productivity. We design and build composable data systems that enable users to write code once, in any language, on any device, and at any scale to develop pipelines faster.

Recently, the folks over at the Ibis Project wrote this wonderful blog about using Ibis and Substrait. In the blog, they lay out how Substrait allows a new, standardized route for the Ibis interface to connect to backends, potentially enabling Ibis development to target Substrait instead of targeting backends individually. They gave a demo of using Ibis to generate Substrait for DuckDB, and laid out their opinions on why Substrait is worthwhile, even with SQL around; the three big takeaways were that Substrait offered standardization, extensibility, and ease of parsing. Substrait is an intermediate representation (IR) for data analytics, offering a standard language for tools to talk to tools without having to think of what the other needs. This is exactly what we’d hope to see from an IR in this field, but wanted to test it ourselves. If Substrait is suitable for the task, then we can write Ibis while simply not thinking about a given backend, then send over the plan.

So, that’s what we’ll do today! We’ll take Ibis, and route it to DataFusion.

Ibis to Substrait

We’ll perform all the same tasks outlined in the blog mentioned at the beginning, but with a key difference. You can refer to it here.

Much as Gil Forsyth says there, Substrait is in its early days — support isn’t complete for all producers and consumers. The current release of DataFusion doesn’t support casting operations issued by Substrait (though that’s already in progress: https://github.com/apache/arrow-datafusion/pull/5137). Instead of casting the ratings data as part of the workflow, we modify imdb_ratings.parquet to classify the averageRating column as float64, and the numVotes column as int32. You can do this in your tool of choice — we recommend Apache Arrow! This also means that instead of:

ratings = ratings.select(
    ratings.tconst,
    avg_rating=ratings.averageRating.cast("float"),
    num_votes=ratings.numVotes.cast("int"),
)

We have:

ratings = ratings.select(
    ratings.tconst,
    avg_rating=ratings.averageRating,
    num_votes=ratings.numVotes,
)

Otherwise, everything from Ibis to Substrait is identical! As Substrait support matures, there will be fewer and fewer cases where there needs to be a modification to front-end code to accommodate for the backend. In the not-too-distant future, we hope to see a world where any base Substrait operation is supported by all producers and consumers of the IR, allowing for frictionless usage.

Substrait to DataFusion

For the time being, to run a Substrait plan in DataFusion, we need to go to Rust. However, that will change in an upcoming release.

In the meantime, we’ll go ahead and use Rust — with Substrait, we have no need to do something like pass around Python objects. We just need the filesystem.


Setup

For reference, we used these imports:

use datafusion_substrait::consumer::from_substrait_plan;
use datafusion_substrait::serializer;

use datafusion::physical_plan::execute_stream;
use datafusion::prelude::*;

use datafusion::arrow::csv::writer::Writer;

use std::fs::File;

use futures::StreamExt;


Ingesting Data

First, we need to get the table in, with the same name used in the plan. Substrait has few requirements, but this is one — otherwise, there would be more syntax needed on all ends to identify tables.

let mut ctx = SessionContext::new();
ctx.register_parquet("basics", "/Users/kaesuarez/Data/imdb/imdb_basics.parquet", ParquetReadOptions::default())
        .await?;
ctx.register_parquet("ratings", "/Users/kaesuarez/Data/imdb/imdb_casted_ratings.parquet", ParquetReadOptions::default())
				.await?;


Ingesting Substrait

Next, we’re going to get the plan into DataFusion. Here, we’re going to focus on the aggregation plan, but replacing the input plan and output path would handle the filtering case:

// First, get it off of bytes on disk and into a Substrait plan.
let proto = serializer::deserialize("topfilms.proto").await?;

// Next, map it to a DataFusion plan.
let plan = from_substrait_plan(ctx, &proto).await?;


Execution

From this point, this is all just DataFusion — the actual consumption of the plan is over. No scrabbling to make input work, the plan is simply ingested. The next block takes the plan, runs it, and writes to CSV:

let state = ctx.state();

let phys_plan = state.create_physical_plan(&plan).await?;

let mut output = execute_stream(phys_plan, ctx.task_ctx())?;

let file = File::create("agg_plan_out.csv").unwrap();

let mut writer = Writer::new(file);

while let Some(batch) = output.next().await {
    writer.write(&batch?)?;
}

For reference, here’s the table from the Ibis article:

tt0111161 9.3 2651547 The Shawshank Redemption 1994
tt0068646 9.2 1838044 The Godfather 1972
tt0468569 9.0 2623735 The Dark Knight 2008
tt0167260 9.0 1827464 The Lord of the Rings: The Return of the King 2003
tt0108052 9.0 1343647 Schindler’s List 1993
tt0071562 9.0 1259465 The Godfather Part II 1974
tt0050083 9.0 782903 12 Angry Men 1957
tt0110912 8.9 2029684 Pulp Fiction 1994
tt1375666 8.8 2325417 Inception 2010
tt0137523 8.8 2096752 Fight Club 1999

And here is the output CSV from DataFusion (slight differences are due to data updating regularly):

tt0111161 9.3 2696853 The Shawshank Redemption 1994
tt0068646 9.2 1871404 The Godfather 1972
tt0468569 9 2670549 The Dark Knight 2008
tt0167260 9 1857367 The Lord of the Rings: The Return of the King 2003
tt0108052 9 1363482 Schindler’s List 1993
tt0071562 9 1279166 The Godfather Part II 1974
tt0050083 9 796371 12 Angry Men 1957
tt0110912 8.9 2069981 Pulp Fiction 1994
tt1375666 8.8 2369147 Inception 2010
tt0137523 8.8 2141856 Fight Club 1999

To quote Gil, “that looks about right to me.” No need for using bridge code between Ibis and DuckDB or Ibis and DataFusion in particular — Substrait alone provides the bridge. Also, a nice bonus comes in the fact that this plan is around as long as it’s saved to disk. If the data updates and the same compute needs to be run again, be it on one of these backends or any other, the same plan file can just be passed along with the fresh data — no need to re-run the producer!

Ibis and Substrait individually offer opportunities for standardization in data analytics stacks, and together they’re even better — Ibis becomes the interface, Substrait the language in the middle, then any back-end handles execution. Both have a lot of momentum, and this connection is only one of the things we’re excited about.

To learn more about Ibis, check out the project page. Also make sure to check out Substrait while you’re at it, especially if you develop front or back-ends — it may be the solution you need to cut down on constant implementation work spent on supporting other tools. If you’re working with either and want to accelerate your success, learn how a Voltron Data Enterprise Support subscription can help you.

Photo by Shiro Hatori