Jul 07, 2022

Engine Agnostic Analytics with Ibis

Patrick Clarke

Free Black Screen with Pink and Green Graph Stock Photo

Suppose you have a large, A-tier team of data analysts, data scientists, data engineers, platform engineers, and ETL (extract, transform, and load) developers. All of them will eventually need to manipulate data. They all are trying to accomplish different things using, presumably, the same collection of datasets.

These datasets might be in one central database. They might also be decentralized, spanning multiple databases of different kinds and varying sizes.

These datasets might be small, or maybe they are incredibly large–so large that it’s ill-advised to yank them all down locally for transformation and reloading.

This army of data wranglers might have different approaches to accessing, manipulating, and storing their data. They might use a set of database connectors to pull in iterables, or they might read a set of query strings to pull data into pandas dataframes, or maybe they’re using an internal tool that is an amalgamation of many external libraries.

No matter the location, size, or manipulation of the data, keeping your analytics codebase organized and efficient can be a house of cards–small changes in the back end or resource constriction in the front end can incur massive amounts of technical debt and bring your A-tier team to a F-tier sluggish crawl. Your team becomes more focused on cleaning up code and fixing errors, when you’d rather they spent more time on and innovation advantage of cutting-edge platform changes.

There is some good news for those seeking to mitigate these problems: the Ibis framework can help solve some of the technical debt and organizational issues that arise from maintaining a python-centric, SQL-heavy, memory-intensive analytics code base.

Ibis is a free and open source Python project that was created by Wes McKinney in late 2014, and is actively maintained by a community of open source developers. Ibis aims to be a robust, scalable, engine-agnostic query expression Python library.

That last sentence sounds like a mouth full of complicated nonsense. Before we break it down, it might help to walk through a very simple example of the code in action.

Example

Let us walk through each line in a simple use case for the Ibis framework. For the purpose of demonstration, we are going to use a publicly-available dataset of countries and country GDP to find the total GDP for each continent in the year 2017. The dataset can be downloaded here.

This problem spans two tables: countries (for a country’s continent) and gdp (for a country’s GDP). Both tables can be joined on a country code. The steps are straightforward: translate a continent code to the continent name, join the tables, group by continent name, and aggregate (sum).

Let’s first take a look at the result call when written in SQL:

SELECT
    continent
    ,SUM(value) AS continent_gdp
FROM (
    SELECT
        CASE WHEN continent == 'NA' THEN 'North America'
            WHEN continent == 'SA' THEN 'South America'
            WHEN continent == 'EU' THEN 'Europe'
            WHEN continent == 'AF' THEN 'Africa'
            WHEN continent == 'AS' THEN 'Asia'
            WHEN continent == 'OC' THEN 'Oceania'
            WHEN continent == 'AN' THEN 'Antarctica'
            ELSE Unknown continent' END AS continent
        ,iso_alpha3
FROM
    countries
) AS c
INNER JOIN (
    SELECT
        country_code
        ,value
    FROM
        gdp
    WHERE
        year = 2017
) AS g
ON g.country_code = c.iso_alpha3
GROUP BY
    continent

It’s a pretty simple query, but it’s simply not pretty. SQL may be close to written English, but some might find it cumbersome or hard to read in large batches. I wouldn’t want to read this again, much less dig through it to fix syntax errors that sprout up from a backend change. I definitely wouldn’t want to do this tens or hundreds of times for more complicated queries.

When there are multiple data users writing their own queries or maintaining old ones, sleuthing through many queries for refactors and fixes can be a nightmare and can quickly lead to developer burnout, and it gets worse the more complicated the queries are.

Enter the same result, written using the Ibis framework:

import ibis

ibis.options.interactive = True

conn = ibis.sqlite.connect('geography.db')

countries = conn.table('countries')
gdp = conn.table('gdp')

gdp_2017 = gdp.filter(gdp['year'] == 2017)

join = countries.inner_join(gdp_2017, predicates=(countries['iso_alpha3'] == gdp_2017['country_code']))

continent_name = (join['continent'].case()
    .when('NA', 'North America')
    .when('SA', 'South America')
    .when('EU', 'Europe')
    .when('AF', 'Africa')
    .when('AS', 'Asia')
    .when('OC', 'Oceania')
    .when('AN', 'Antarctica')
    .else_('Unknown continent')
    .end()
    .name('continent'))

sum_value = join['value'].sum().name('continent_gdp')
result = join.group_by(continent_name).aggregate(sum_value)['continent', 'continent_gdp']

result.head()

Let’s walk through the example now.

First, import ibis:

import ibis

To view our results in terminal, iPython, or Jupyter notebooks, we’ll turn on interactive mode:

ibis.options.interactive = True

Typically automated pipelines won’t need to use interactive mode–by leaving it off we can save ourselves some processing power. If we left it off, all we would need to do is run the .execute() method after whatever we want to see.

Next, connect to our database. In this case, we are connecting to a sqlite database named Geography. This has a few tables, but in this example we are using countries and gdp:

conn = ibis.sqlite.connect(geography.db)

We’ll connect to those two tables–this doesn’t do anything yet, but does establish some shortcuts for accessing and manipulating these tables. We use the conn.table method:

countries = conn.table('countries')
gdp = conn.table(gdp)

Note that we can use these tables for other tasks within the scope of our script–we’re not limited to one-query-per-table-call. If we need to access these tables, we don’t need to establish another FROM clause. We can just reference the table expression’s variable.

Next, we filter on the gdp table to only show rows for the year 2017:

gdp_2017 = gdp.filter(gdp[year] == 2017)

Next, we join our countries table to our filtered gdp resultset to match country names to 2017 gdp values:

join = countries.inner_join(gdp_2017, predicates=(countries[iso_alpha3] == gdp_2017[country_code]))

To get continent names, we need to parse continent codes (two-letter codes for the seven continents). We can do this using a CASE/WHEN statement:

continent_name = (join['continent'].case()
    .when('NA', 'North America')
    .when('SA', 'South America')
    .when('EU', 'Europe')
    .when('AF', 'Africa')
    .when('AS', 'Asia')
    .when('OC', 'Oceania')
    .when('AN', Antarctica)
    .else_('Unknown continent')
    .end()
    .name(continent))

To aggregate our groupings, we’ll use a sum expression and rename the column to “continent_gdp”

sum_value = join[value].sum().name(continent_gdp)

We can then obtain our final result by grouping our countries by continent and then aggregating using our sum aggregate expression:

result = join.group_by(continent_name).aggregate(sum_value)[continent, continent_gdp]

And finally, we can preview the result set by displaying the first few rows using .head():

result.head()

Ibis as a Robust, Scalable, Engine-Agnostic Query Expression Python Library

Now that we’ve seen Ibis in action, let’s pick apart what Ibis aims to be: a robust, scalable, engine-agnostic query expression Python framework.

Robust

Ibis aims to be a robust framework. As an open source project, users are free to contribute according to their needs. If users want another backend implemented, they can write one up and set up a PR for implementation.

The more contributions, the more robust the framework.

Scalable

Ibis doesn’t pull data into memory until the user tells it to. The package is named after a bird, the Ibis, that is commonly-known to stand on top of elephants–a metaphor for lightweight Python code that tells the backend to do the heavy lifting.

Write your pipeline directions in your script, have the backend execute as much of it as possible, and then pull in the pared down result set. From lightweight CSVs read in through pandas to heavy-duty datasets on HDFS read through PySpark, the code will behave the exact same way: make the backend do the lifting, pull in data only when you need it.

Engine-agnostic

Ibis supports many backends. Earlier in our example, a keen observer might have noticed that we connected to our database using a backend connection, conn = ibis.sqlite.connect(‘geography.db’).

Supported backends will return connection objects that all behave the same way–therefore, a user can write code against one supported backend and then swap their connection to another supported backend and (theoretically) execute the exact same code that was written for and used against the first one.

Query expressions

Ibis allows users to write queries in Python–by using Ibis expressions (Ibis objects like classes, functions, and methods), users can minimize the curation of formatted strings and clunky, difficult-to-deal-with SQL statements.

Python library

Ibis is a Python library. Python users can install ibis through common channels like pip or conda and use it in their environments to interact with their databases.

Why Use Ibis

Typically, fast-paced analytics teams use formatted strings to manage their queries against back end databases. These queries might be large blocks of multi-line strings, or they might be a collection of strings that can be cobbled together to make a query.

For organizational purposes, variables within queries will be grouped up front in their scripts (or in their scheduled job variables) and queries will be edited and executed using those variables.

Ibis can be organized the same way–define your variables up front and any changes applied to those variables can run downstream to the rest of the query code (which would be a collection of expressions, as outlined in the example above).

Ibis has an advantage though: if a user wants to use a new backend, they can simply change the connection object to point to that new backend, and any code supported by the new backend doesn’t need to be updated.

Where users managing query strings might have to comb through every query to make sure the old code is compatible with the new backend, Ibis users will just need to swap the connector.

This feature can be used in a multitude of ways.

One case is prototyping a new workflow. Users can mock a database using the Pandas backend, write query-like code against it, and when that data is in production, just change the connector to the production backend. Minimal code change to go from prototype to production:

import ibis

fpath = path/to/file.xlsx

xlf = pd.ExcelFile(fpath)

conn = ibis.pandas.connect({
    name: xlf.parse(sheet_name=name)
    for name in xlf.sheet_names
})

table1 = conn.table(table1)
table2 = conn.table(table2)
<...>
result = tablen.group_by(exprn1)[col1, col2]

This prototyping process works well in iPython or Jupyter Notebooks–once you’re finished testing your prototype, all you need to do is remove the pandas bits, swap out the connector, and execute.

Ibis can also easily integrate with scheduled task platforms like Apache Airflow. Your DAGs using Ibis to manage database calls will use fewer resources than they would pulling it directly using pandas–and you don’t need to manage a plethora of formatted strings that might fail across multiple backends. Just standardized calls in Python code!

Conclusion

Using Ibis as a lightweight interface to data can be a boon for any analytics team. It future-proofs possible migrations, unifies SQL calls for diverse backends, reduces compute load on the frontend, and provides Python users a powerful expression language that is familiar and cohesive to read.

Ibis is free and open source and supports ten backends, including Postgres, MySQL, DuckDB, and PySpark. An operational support matrix can be found on the project website, and the project’s contribution guide can get you started on bolstering your favorite backend’s support.

Ibis was updated to version 3.0.0 in April 2022–notable changes include DuckDB support, arbitrary SQL support for some backends, and several bug fixes. The project is active and is being supported by the OSS community. Read about some of the changes in the official blog post, or read the full release notes here.

If you’re looking for a more in-depth view on how people are using Ibis, Ibis was a topic at Voltron Data’s digital conference, The Data Thread, which took place on June 23, 2022. Here’s a Ibis spotlight blog post I wrote that summarizes some of the content you can find. You can also listen to these talks via a curated YouTube playlist from the event. Presenters include Phillip Cloud, Marlene Mhangami, and myself.

You can also learn how a Voltron Data Enterprise Support subscription can help accelerate your success with Ibis.