Nov 17, 2022
Ibis Explained: Increasing Code Portability and Performance Gains
Ibis is an emerging open source framework with 280,000 downloads in the last month (October ‘22). We talk about – and contribute to – Ibis a lot here at Voltron Data. A key part of our company culture is knowledge sharing, which is why we’re releasing this “Ibis Explained” series. It is inspired by the themes and questions that come up from Voltronauts and the community at large.
Our first post covered the ways Ibis and pandas compare. Today, we’re talking about the code portability and performance gains you can achieve with Ibis.
How is Ibis Portable?
Ibis is an engine-agnostic framework for querying data. This means that no matter which engine you’re using to wrangle your data, Ibis’s syntax and concepts are the same – even if your data is in a local pandas DataFrame or if it’s a Parquet file stored on a Hadoop Distributed File System (HDFS) queried through PySpark.
Of course, not all engines support all operations. There are certain nuances between how engines operate, and there can be some functionality that isn’t currently (or won’t ever be) supported. However, there is a large chunk of functionality that is supported across multiple engines. For that intersection, Ibis code will remain the same between engines, which helps with tech stack migrations and upscaling/downsizing a workflow.
One example of easily-portable Ibis code is the bfill and ffill function that we wrote about in September. We demonstrated that this code works for several engines, including pandas, Postgres, BigQuery, and DuckDB in a companion notebook. To get it working for a new backend that supports the function’s operations, all you need to do is swap out the connector.
What Performance Gains Can I Expect to See with Ibis?
The Ibis 3.2 release brought significant performance gains for users. Marlene Mhangami, Developer Advocate at Voltron Data, conducted a test with a large aggregation in a pandas dataframe and noted an 8x speed boost by first converting the DataFrame to an Ibis MemTable and then performing the aggregations using Ibis expressions.
(Note: Ibis MemTables have all sorts of uses – particularly when joining in-memory DataFrames to tables in your backend. You can read a quick how to guide on the Ibis Project website.)
Users performing large aggregations on massive datasets in-memory through DataFrame APIs, like pandas, might run into performance issues. These aggregations are often expressed using SQL statements against the backend to trim down the data before pulling it into memory for further analysis. These SQL statements, however, are prone to dialect issues or, if parameterized, a whole plethora of Python and SQL issues.
Not only does Ibis use Python objects directly and error-check before execution, it also utilizes deferred execution to craft a recipe and then pushes the hard work of that recipe on the backend rather than trying to do this hard work in-memory, locally.
What Features Can We Look Forward to in the Next Ibis Release?
Version 4.0 is slated to be released early next year, and with it comes many new features that pandas users will be excited about.
One feature is the first alternative in-memory expression result: pyarrow objects and RecordBatches. The default in-memory expression result has been pandas DataFrames, but with Ibis 4.0 users will be able to output directly to pyarrow objects or RecordBatches, which opens up many new possibilities for large datasets.
Version 4.0 will also bring the read function, allowing users to read files using the default backend (currently DuckDB) without spinning up a connection. Users can pass in multiple file formats including CSV and Parquet, and can also include mixed compressed/uncompressed files. This new functionality makes it easier to quickly perform exploratory analysis without boilerplate setup.
Give Ibis a Try
Downloading Ibis is easy. For standard Ibis, which includes the pandas and SQLite backends, you can install via PyPI through pip:
pip install ibis-framework
For specific backends, include
[<backend>] at the end of
ibis-framework and wrap it in quotes. You can even install multiple backends in one command by comma-delimiting them. For example, if you want to install both the DuckDB and Postgres backends through pip:
pip install ‘ibis-framework[duckdb,postgres]’
You can also download the base package through the conda-forge channel using conda or mamba:
conda install -c conda-forge ibis-framework
And for specific backends, install
ibis-<backend>. For example, for the DuckDB backend:
conda install -c conda-forge ibis-duckdb
Try out Ibis’s deferred execution by integrating it into one of your heavier workloads and let us know how it works out.
Next Up: Modernizing Your Tech Stack with Ibis
Next in this series, we will discuss what Ibis might look like in a modern tech stack and how we can use Ibis to improve workflows for data science and machine learning – big and small.
Until then, let us know if you start to use Ibis in your workflows. You can send us a Tweet or DM on Twitter @VoltronData. We’d love to hear how you’re using Ibis in your workflows.
Earlier this year, Voltron Data added Ibis support to our Enterprise Subscription services. If your company is interested in developing tools and workflows built on top of Ibis, please take a look at our subscription tiers and get in touch.
Stay Up to Date with Ibis
Stay Up to Date with Ibis If you want to learn more or stay up to date with the Ibis project, tune into these channels:
Photo credit: Ahsan Avi