Oct 20, 2022
Ibis Explained: Making DataFrames, Big and Small, More Delightful
Patrick Clarke, Alison Hill
At Voltron Data, we talk a lot about building bridges, not walls. In this post, we’re providing a peek into what this looks like internally as we build a culture of knowledge sharing and documentation. It isn’t just an empty tagline! We are constantly challenging ourselves to create opportunities for our employees, including folks who might not identify themselves as highly technical, to learn about our projects and the tools being developed.
Speaking of bridges: we’re investing in and contributing to projects like Ibis that function as a bridge between users and their data. Internally, we just wrapped up our week of learning about “Ibis 101”, which ended with an “ask me anything” session.
In case you need a primer: Ibis helps Python users explore and transform data of any size, stored anywhere–from CSVs read as in-memory DataFrames to tables stored in distributed data warehouses. It provides users with a DataFrame API to manipulate rows and columns, utilities to connect to 10+ query engines, and a mode for pushing code execution to those query engines.
Today, we’ll share the discussion centered around pandas, another bridge between users and their data, and how Ibis and pandas are related.
How Does Ibis Differ From pandas?
First, why was there much discussion about pandas? The pandas Python project is a sprawling behemoth of a DataFrame and analytics package. It has close to three million daily downloads and is one of the most popular DataFrame APIs for Python analytics workflows (if not the most popular).
So, what does Ibis have to do with pandas? For starters, they have one thing in common: both were created by Wes McKinney, Co-Founder and CTO of Voltron Data, to facilitate and streamline data analytics in Python.
The difference between the two is how they go about accomplishing this.
- pandas focuses on streamlining transformation and analytics on data locally.
- Ibis focuses on deferred execution and pushing the heavy lifting of transformations to wherever the data lives–reducing the compute effort locally so resources can be better allocated in shared spaces. This means users can execute at the speed of their backend, not their local computer (h/t to Marlene Mhangami for this quote!).
When Would I Use Ibis Over Pandas?
Modern datasets can contain millions – sometimes billions – of rows of data and local execution requires much, much more memory than it did 10 years ago. We recommend reading McKinney’s article, 10 things I hate about pandas, where he noted, “pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset.” Managing large datasets locally can get prohibitively expensive for growing teams of data scientists, data engineers, and students.
Developers can use Ibis as a consistent Python API to pare down large datasets before pulling what they need into memory. By using Ibis, developers can avoid fstrings of varying SQL dialects, reference python objects directly in their pre-processing, and push all of the hard work on the backend.
Once all of the memory-intensive transforms are complete, developers can use Ibis to pull the final set into memory as a pandas DataFrame and utilize all of the functionality that the pandas project has built up over time (or, upcoming in 4.0: alternative formats like pyarrow Tables and RecordBatches and their corresponding functionality).
They’re Similar, but How Similar?
Conceptually, Ibis and pandas are very similar: they both manipulate tabular data. We can take a DataFrame, and transform it by operating on columns or by creating new ones.
For users that are interested to see how Ibis compares to pandas for their particular workflow, we can check out similarities between the two APIs to see how easy it is to learn Ibis coming from pandas.
First up is the return type. The default return type from Ibis’s
execute is a pandas DataFrame. With this, all pandas DataFrame methods and operations are available. You can use Ibis to filter down your data so that it’s small enough to pull into memory, and then still use the functionality of a pandas DataFrame to complete the task at hand.
Next is column referencing. Referencing a single column in an Ibis expression is the exact same way you reference a column as a series in pandas–simply use single brackets and a string column name or ColumnExpressions.
There is a slight difference between Ibis and pandas, though. In pandas, if you want to select multiple columns (to return a DataFrame containing a subset of columns), you would use single brackets enclosing a list of string column names. In Ibis, you can use single- or double- brackets and comma-delimited string column names or ColumnExpressions:
The more Ibis-y way of selecting columns is to use the select method on a TableExpression, though, so our opinion is to just use that instead (the select method also accepts string names or ColumnExpressions):
The last similarity that we’ll discuss is groupby. Ibis does support groupby and aggregations, just like pandas. You can group a TableExpression just as you would a pandas DataFrame and then aggregate:
Next up: Code Portability and Performance Gains with Ibis
Next in this series, we will discuss what we find interesting and exciting about switching certain workflows from pandas to Ibis, particularly code portability and performance gains.
In the meantime, download and try Ibis today. You might find performance boosts by switching some pandas loads and transforms for Ibis selects, filters, and mutates.
Earlier this year, Voltron Data added Ibis support to our Enterprise Subscription services. If your company is interested in developing tools and workflows built on top of Ibis, please take a look at our subscription tiers and get in touch.
Stay up to Date with Ibis
If you want to learn more or stay up to date with the Ibis project, tune into these channels:
Photo by Heiko E. Janssen