Feb 22, 2023
Ibis and Substrait: Supercharging Portability
Kae Suarez
Multi-language tools allow users the freedom to work where they want, how they want, unleashing greater productivity. We design and build composable data systems that enable users to write code once, in any language, on any device, and at any scale to develop pipelines faster.
Recently, the folks over at the Ibis Project wrote this wonderful blog about using Ibis and Substrait. In the blog, they lay out how Substrait allows a new, standardized route for the Ibis interface to connect to backends, potentially enabling Ibis development to target Substrait instead of targeting backends individually. They gave a demo of using Ibis to generate Substrait for DuckDB, and laid out their opinions on why Substrait is worthwhile, even with SQL around; the three big takeaways were that Substrait offered standardization, extensibility, and ease of parsing. Substrait is an intermediate representation (IR) for data analytics, offering a standard language for tools to talk to tools without having to think of what the other needs. This is exactly what we’d hope to see from an IR in this field, but wanted to test it ourselves. If Substrait is suitable for the task, then we can write Ibis while simply not thinking about a given backend, then send over the plan.
So, that’s what we’ll do today! We’ll take Ibis, and route it to DataFusion.
Ibis to Substrait
We’ll perform all the same tasks outlined in the blog mentioned at the beginning, but with a key difference. You can refer to it here.
Much as Gil Forsyth says there, Substrait is in its early days — support isn’t complete for all producers and consumers. The current release of DataFusion doesn’t support casting operations issued by Substrait (though that’s already in progress: https://github.com/apache/arrow-datafusion/pull/5137). Instead of casting the ratings data as part of the workflow, we modify imdb_ratings.parquet
to classify the averageRating
column as float64
, and the numVotes
column as int32
. You can do this in your tool of choice — we recommend Apache Arrow! This also means that instead of:
ratings = ratings.select(
ratings.tconst,
avg_rating=ratings.averageRating.cast("float"),
num_votes=ratings.numVotes.cast("int"),
)
We have:
ratings = ratings.select(
ratings.tconst,
avg_rating=ratings.averageRating,
num_votes=ratings.numVotes,
)
Otherwise, everything from Ibis to Substrait is identical! As Substrait support matures, there will be fewer and fewer cases where there needs to be a modification to front-end code to accommodate for the backend. In the not-too-distant future, we hope to see a world where any base Substrait operation is supported by all producers and consumers of the IR, allowing for frictionless usage.
Substrait to DataFusion
For the time being, to run a Substrait plan in DataFusion, we need to go to Rust. However, that will change in an upcoming release.
In the meantime, we’ll go ahead and use Rust — with Substrait, we have no need to do something like pass around Python objects. We just need the filesystem.
Setup
For reference, we used these imports:
use datafusion_substrait::consumer::from_substrait_plan;
use datafusion_substrait::serializer;
use datafusion::physical_plan::execute_stream;
use datafusion::prelude::*;
use datafusion::arrow::csv::writer::Writer;
use std::fs::File;
use futures::StreamExt;
Ingesting Data
First, we need to get the table in, with the same name used in the plan. Substrait has few requirements, but this is one — otherwise, there would be more syntax needed on all ends to identify tables.
let mut ctx = SessionContext::new();
ctx.register_parquet("basics", "/Users/kaesuarez/Data/imdb/imdb_basics.parquet", ParquetReadOptions::default())
.await?;
ctx.register_parquet("ratings", "/Users/kaesuarez/Data/imdb/imdb_casted_ratings.parquet", ParquetReadOptions::default())
.await?;
Ingesting Substrait
Next, we’re going to get the plan into DataFusion. Here, we’re going to focus on the aggregation plan, but replacing the input plan and output path would handle the filtering case:
// First, get it off of bytes on disk and into a Substrait plan.
let proto = serializer::deserialize("topfilms.proto").await?;
// Next, map it to a DataFusion plan.
let plan = from_substrait_plan(ctx, &proto).await?;
Execution
From this point, this is all just DataFusion — the actual consumption of the plan is over. No scrabbling to make input work, the plan is simply ingested. The next block takes the plan, runs it, and writes to CSV:
let state = ctx.state();
let phys_plan = state.create_physical_plan(&plan).await?;
let mut output = execute_stream(phys_plan, ctx.task_ctx())?;
let file = File::create("agg_plan_out.csv").unwrap();
let mut writer = Writer::new(file);
while let Some(batch) = output.next().await {
writer.write(&batch?)?;
}
For reference, here’s the table from the Ibis article:
tt0111161 | 9.3 | 2651547 | The Shawshank Redemption | 1994 |
---|---|---|---|---|
tt0068646 | 9.2 | 1838044 | The Godfather | 1972 |
tt0468569 | 9.0 | 2623735 | The Dark Knight | 2008 |
tt0167260 | 9.0 | 1827464 | The Lord of the Rings: The Return of the King | 2003 |
tt0108052 | 9.0 | 1343647 | Schindler’s List | 1993 |
tt0071562 | 9.0 | 1259465 | The Godfather Part II | 1974 |
tt0050083 | 9.0 | 782903 | 12 Angry Men | 1957 |
tt0110912 | 8.9 | 2029684 | Pulp Fiction | 1994 |
tt1375666 | 8.8 | 2325417 | Inception | 2010 |
tt0137523 | 8.8 | 2096752 | Fight Club | 1999 |
And here is the output CSV from DataFusion (slight differences are due to data updating regularly):
tt0111161 | 9.3 | 2696853 | The Shawshank Redemption | 1994 |
---|---|---|---|---|
tt0068646 | 9.2 | 1871404 | The Godfather | 1972 |
tt0468569 | 9 | 2670549 | The Dark Knight | 2008 |
tt0167260 | 9 | 1857367 | The Lord of the Rings: The Return of the King | 2003 |
tt0108052 | 9 | 1363482 | Schindler’s List | 1993 |
tt0071562 | 9 | 1279166 | The Godfather Part II | 1974 |
tt0050083 | 9 | 796371 | 12 Angry Men | 1957 |
tt0110912 | 8.9 | 2069981 | Pulp Fiction | 1994 |
tt1375666 | 8.8 | 2369147 | Inception | 2010 |
tt0137523 | 8.8 | 2141856 | Fight Club | 1999 |
To quote Gil, “that looks about right to me.” No need for using bridge code between Ibis and DuckDB or Ibis and DataFusion in particular — Substrait alone provides the bridge. Also, a nice bonus comes in the fact that this plan is around as long as it’s saved to disk. If the data updates and the same compute needs to be run again, be it on one of these backends or any other, the same plan file can just be passed along with the fresh data — no need to re-run the producer!
Ibis and Substrait individually offer opportunities for standardization in data analytics stacks, and together they’re even better — Ibis becomes the interface, Substrait the language in the middle, then any back-end handles execution. Both have a lot of momentum, and this connection is only one of the things we’re excited about.
To learn more about Ibis, check out the project page. Also make sure to check out Substrait while you’re at it, especially if you develop front or back-ends — it may be the solution you need to cut down on constant implementation work spent on supporting other tools. If you’re working with either and want to accelerate your success, learn how a Voltron Data Enterprise Support subscription can help you.
Photo by Shiro Hatori