Sep 20, 2022

One Function. No Rewrites. Explore Ibis for Filling Null Values.

Patrick Clarke

Highway during the golden hour

Have you ever wanted your query and transform code to work against your tabular data, no matter where it lived?

Well, it’s possible, and I’m going to demonstrate how you can actually achieve this with Ibis using a function that forward- or backward- fills null values in a dataframe or table.

That’s right: you can use the exact same directional fill code for your data in Postgres tables, BigQuery datasets, and even Pandas dataframes. One function – no rewrites.

Before I get to that, I’ll review what Ibis is for those who aren’t familiar with the project.

Ibis is an open source project that allows for engine-agnostic querying using python expressions. Or, put simply: it’s a python package that you can use to connect to and query data, and the API works the same way for multiple backends–like DuckDB or Postgres.

Backends are extensions to the framework. There are several first-party (Ibis Project) maintained backends like Pandas, DuckDB and Postgres, and some third party backends like Google’s ibis-BigQuery. There’s even a Substrait backend in development for more sophisticated users that are on the cutting-edge of interoperability in data tech stacks.

Ibis is a powerful tool. It significantly reduces tech debt when performing data migrations and allows for a unified, pythonic way for users to access and transform their data. I wrote about this in a walkthrough on engine-agnostic analytics, but the bottom line is:

  • No more fstrings littering your console, IDE, or notebook
  • No more extensive code refactors slowing you down from migrating to the latest and greatest hardware or software
  • No more cumbersome vendor-locking portal-based APIs

It is an incredible framework for people invested in staying up-to-date with the fastest and smartest tech available: the less time your team spends migrating your code, the more time your team can spend playing with your new toys.

Let’s go through an example of this in action by implementing bfill and ffill and then executing the same function against multiple backends.

bfill and ffill with Ibis

Fill functions fill in missing data based on present data. This is typically used in time series data with holes in values between dates, where values aren’t subject to change over time (or, if they do, it’s okay to assume that they don’t).

For example, the forward fill function works like this:

  1. If this is the first value, then skip to (3), even if this value is null
  2. If this value is null, then replace it with the previous value (even if the previous value was null)
  3. Go to the next value

Basically, you’re taking all of the null values after a non-null value and replacing them with that non-null value. I explain this in more detail in the strategy section on the original blog post.

Taken from the Ibis Project How To guide I contributed, this can be written as follows:

# Group your data by your event partition and then order your series (default ascending)
win = ibis.window(group_by=data.event_id, order_by=data.measured_on, following=0)
# Create a grouping that is a rolling count of non-null values within each event
# This creates a partition where each set has no more than one non-null value
grouped = data.mutate(grouper=data.measurement.count().over(win))
# Group by your newly-created grouping and, in each set,
# set all values to the one non-null value in that set (if it exists)
result = (
    grouped
    .group_by([grouped.event_id, grouped.grouper])
    .mutate(ffill=grouped.measurement.max())
)
# execute to get a pandas dataframe, sort values in case your backend shuffles
result.execute().sort_values(by=['event_id', 'measured_on'])

I decided to put this in a function and execute it against the multiple backends using the same data I used in the blog post. Here is the function (with some small modifications):

import ibis

def dirfill_na(data, order_by='measured_on', group_by=None, value='measurement', fill_dir='f'):
    dirs = {
        'f': ibis.asc
        ,'b': ibis.desc
    }
    if fill_dir not in dirs:
        raise ValueError(f"method must be in {dirs.keys()}")
    else:
        strat_order = dirs[fill_dir](data[order_by])

    # Create a window that orders your series, default ascending
    win = ibis.window(group_by=None if group_by is None else data[group_by], order_by=strat_order, following=0)
    # Create a grouping that is a rolling count of non-null values
    # This creates a partition where each set has no more than one non-null value
    grouped = data.mutate(grouper=data[value].count().over(win))
    # Group by your newly-created grouping and, in each set,
    # set all values to the one non-null value in that set (if it exists)
    result = (
    grouped
        .group_by([grouped.grouper] if not group_by else [grouped[group_by], grouped.grouper])
        .mutate(filled=grouped[value].max())
        .relabel({'filled': f'{fill_dir}fill'})
    )
    # execute to get a pandas dataframe, sort values in case your backend shuffles
    return result.sort_by( (order_by, True) )

I can use this function to pick the order column I care about, pick a strategy without thinking about the mechanics of the fill, and decide if I even want to group before filling (if group_by is None, then treat the entire set as one event).

To test interoperability, I tested this on a few backends. I kept the DataFrame around to test against the Pandas backend, I saved it to parquet to test against the DuckDB backend, uploaded it to a .db file to test against SQLite, uploaded it to a table in Postgres to test against the Postgres backend, and uploaded it to a BigQuery Dataset to test against BigQuery.

And it worked against all of them! You can find the code rendered in a notebook on my GitHub page.

To keep up with the latest Ibis news, follow @IbisData on Twitter, subscribe to Phillip Cloud’s live streams and videos on YouTube, star the repository on GitHub, and browse the Ibis Project website.

The team at Voltron Data is dedicated to continuously improving Ibis and contributed to the recent release of version 3.2. Ibis is free and open source — install it today through pip: pip install ibis-framework or conda: conda install -c conda-forge ibis-framework.

Learn how a Voltron Data Enterprise Support subscription can help accelerate your success with Ibis.

Photo by Frankie Lopez