Mar 20, 2023

SQL and Data Frames Unite with Ibis

Marlene Mhangami

aerial shot of highway system
Work in Your Language of Choice
Multi-language tools allow users the freedom to work where they want, how they want, unleashing greater productivity. We design and build composable data systems that enable users to write code once, in any language, on any device, and at any scale to develop pipelines faster.

SQL or Python?

Choosing the right programming language for data warehousing pipelines can be challenging. SQL and Python are two of the most popular options, each with its own strengths and weaknesses. SQL is known for its performance and ability to handle structured data, while Python is flexible, easy-to-learn, and has a vast library of tools for data analysis, machine learning, and more. In the most recent Stack Overflow Developer Survey, a nearly equal percentage of developers reported using SQL or Python extensively in their work. Teams often choose to use one or both of these languages based on specific use cases or preferences.

Stack Overflow Developer Survey 2022 Top Technologies Section

Stack Overflow Developer Survey 2022 Top Technologies Section

However, when using different languages, collaborating can become challenging. To work together effectively, teams need to bridge the language gap and ensure that everyone is on the same page. In this blog post, we’ll explore how Ibis provides a uniform Python API for data warehousing tools and generates SQL that can be shared with team members or friends. By using Ibis, teams can work more efficiently, regardless of the programming languages they use.

Bridging the Gap with Ibis

Ibis is a Python library that provides a lightweight, universal interface for data wrangling. It helps Python users explore and transform data of any size, stored anywhere. To do this, under the hood Ibis compiles and generates SQL for many backends, including MySQL, PostgresSQLDuckDBPySpark, and BigQuery, amongst others. Although this happens behind the scenes, the generated SQL is easily accessible using the ibis.show_sql feature! Let’s look at a practical example of this.

Python Analytics with Ibis

We’ll start by using the new Ibis examples (ex) feature to grab the ‘palmer penguins’ dataset. We can import the ex module from ibis.interactive and access the penguins dataset by running ex.penguins.fetch(). Here’s the code we need to do this:

In [8]: from ibis.interactive import ex

In [11]: t = ex.penguins.fetch()

In [12]: t
Out[12]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
 species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
 string   string     float64         float64        int64              int64        string  int64 
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
 Adelie   Torgersen            39.1           18.7                181         3750  male     2007 
 Adelie   Torgersen            39.5           17.4                186         3800  female   2007 
 Adelie   Torgersen            40.3           18.0                195         3250  female   2007 
 Adelie   Torgersen             nan            nan                                        2007 
 Adelie   Torgersen            36.7           19.3                193         3450  female   2007 
 Adelie   Torgersen            39.3           20.6                190         3650  male     2007 
 Adelie   Torgersen            38.9           17.8                181         3625  female   2007 
 Adelie   Torgersen            39.2           19.6                195         4675  male     2007 
 Adelie   Torgersen            34.1           18.1                193         3475          2007 
 Adelie   Torgersen            42.0           20.2                190         4250          2007 
                                                                                          
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Next, we’ll use a single Ibis expression to do some fairly complex analytics work. Our code will:

  • compute the z-score of every numeric column and replace the existing data with that normalized value
  • group the data by species and min/max normalization
  • filter the resulting table so we keep rows where all the bill size z-score related columns’ absolute values are greater than 2

This is what the filter expression looks like:


In [78]: t.drop("year").group_by("species").mutate(
    ...:     s.across(s.numeric(), dict(zscore=(_ - _.mean()) / _.std()))
    ...: ).filter(s.if_all(s.startswith("bill") & s.endswith("_zscore"), _.abs() > 2))
Out[78]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
 species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     bill_length_mm_zscore  bill_depth_mm_zscore  flipper_length_mm_zscore  body_mass_g_zscore 
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
 string   string     float64         float64        int64              int64        string  float64                float64               float64                   float64            
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────────────────────┼──────────────────────┼──────────────────────────┼────────────────────┤
 Gentoo   Biscoe               59.6           17.0                230         6050  male                 3.924621              2.056508                  1.975799            1.932062 
 Gentoo   Biscoe               55.9           17.0                228         5600  male                 2.724046              2.056508                  1.667394            1.039411 
 Adelie   Torgersen            46.0           21.5                194         4200  male                 2.706539              2.592071                  0.618760            1.088911 
 Adelie   Dream                32.1           15.5                188         3050  female              -2.512345             -2.339505                 -0.298747           -1.418906 
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────────────────────┴──────────────────────┴──────────────────────────┴────────────────────┘```

With Ibis we are able to express complex filters more concisely, in this case in just one line of code!

SQL Generation with Ibis

As mentioned earlier, if we’re working with another team that needs to access the same information but would prefer using SQL, Ibis can generate this code for us too. This is the code we’d need to run in order to do this:

In [79]: ibis.show_sql(
    ...:     t.drop("year")
    ...:     .group_by("species")
    ...:     .mutate(s.across(s.numeric(), dict(zscore=(_ - _.mean()) / _.std())))
    ...:     .filter(s.if_all(s.startswith("bill") & s.endswith("_zscore"), _.abs() > 2))
    ...: )

Out [80]:
WITH t0 AS (
  SELECT
    t2.species AS species,
    t2.island AS island,
    t2.bill_length_mm AS bill_length_mm,
    t2.bill_depth_mm AS bill_depth_mm,
    t2.flipper_length_mm AS flipper_length_mm,
    t2.body_mass_g AS body_mass_g,
    t2.sex AS sex
  FROM ibis_read_csv_3 AS t2
), t1 AS (
  SELECT
    t0.species AS species,
    t0.island AS island,
    t0.bill_length_mm AS bill_length_mm,
    t0.bill_depth_mm AS bill_depth_mm,
    t0.flipper_length_mm AS flipper_length_mm,
    t0.body_mass_g AS body_mass_g,
    t0.sex AS sex,
    (
      t0.bill_length_mm - AVG(t0.bill_length_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP(t0.bill_length_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS bill_length_mm_zscore,
    (
      t0.bill_depth_mm - AVG(t0.bill_depth_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP(t0.bill_depth_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS bill_depth_mm_zscore,
    (
      t0.flipper_length_mm - AVG(t0.flipper_length_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP(t0.flipper_length_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS flipper_length_mm_zscore,
    (
      t0.body_mass_g - AVG(t0.body_mass_g) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP(t0.body_mass_g) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS body_mass_g_zscore
  FROM t0
)
SELECT
  t1.species,
  t1.island,
  t1.bill_length_mm,
  t1.bill_depth_mm,
  t1.flipper_length_mm,
  t1.body_mass_g,
  t1.sex,
  t1.bill_length_mm_zscore,
  t1.bill_depth_mm_zscore,
  t1.flipper_length_mm_zscore,
  t1.body_mass_g_zscore
FROM t1
WHERE
  ABS(t1.bill_length_mm_zscore) > CAST(2 AS SMALLINT)
  AND ABS(t1.bill_depth_mm_zscore) > CAST(2 AS SMALLINT)

The generated SQL query is clearly much longer than the Ibis expression, but is fairly trivial to generate using ibis.show_sql.

Bonus: From SQL Back to Ibis

As an added bonus, Ibis can take in any SQL query and turn it into an Ibis expression. To show how this works, let’s take the expression we used above, turn it into SQL, and then turn it straight back into the original expression. Here’s the code:

In [16]: expr = (
    ...:     t.drop("year")
    ...:     .group_by("species")
    ...:     .mutate(s.across(s.numeric(), dict(zscore=(_ - _.mean()) / _.std())))
    ...:     .filter(s.if_all(s.startswith("bill") & s.endswith("_zscore"), _.abs() > 2))
    ...: )

#Lets convert expression to SQL and store this in the SQL variable
In [17]: sql = str(ibis.to_sql(expr))

#Lets see what is in the sql variable
In [18]: sql[:30]
#Looks like we have SQL
Out[18]: 'WITH t0 AS (\n  SELECT\n    t2.s'

#Now lets ingest this SQL string back into Ibis and use it
In [19]: ibis.get_backend().sql(sql)
Out[19]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
 species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     bill_length_mm_zscore  bill_depth_mm_zscore  flipper_length_mm_zscore  body_mass_g_zscore 
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
 string   string     float64         float64        int64              int64        string  float64                float64               float64                   float64            
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────────────────────┼──────────────────────┼──────────────────────────┼────────────────────┤
 Gentoo   Biscoe               55.9           17.0                228         5600  male                 2.724046              2.056508                  1.667394            1.039411 
 Gentoo   Biscoe               59.6           17.0                230         6050  male                 3.924621              2.056508                  1.975799            1.932062 
 Adelie   Torgersen            46.0           21.5                194         4200  male                 2.706539              2.592071                  0.618760            1.088911 
 Adelie   Dream                32.1           15.5                188         3050  female              -2.512345             -2.339505                 -0.298747           -1.418906 
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────────────────────┴──────────────────────┴──────────────────────────┴────────────────────┘

#And lets compare this with the original expression generated
In [20]: expr
Out[20]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
 species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     bill_length_mm_zscore  bill_depth_mm_zscore  flipper_length_mm_zscore  body_mass_g_zscore 
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
 string   string     float64         float64        int64              int64        string  float64                float64               float64                   float64            
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────────────────────┼──────────────────────┼──────────────────────────┼────────────────────┤
 Gentoo   Biscoe               55.9           17.0                228         5600  male                 2.724046              2.056508                  1.667394            1.039411 
 Gentoo   Biscoe               59.6           17.0                230         6050  male                 3.924621              2.056508                  1.975799            1.932062 
 Adelie   Torgersen            46.0           21.5                194         4200  male                 2.706539              2.592071                  0.618760            1.088911 
 Adelie   Dream                32.1           15.5                188         3050  female              -2.512345             -2.339505                 -0.298747           -1.418906 
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────────────────────┴──────────────────────┴──────────────────────────┴────────────────────┘

Of course, we are not surprised to see the results match, but it is important to note that Ibis can ingest any SQL expression you give it, not just the one it generated. Imagine the collaboration opportunities this could bring!

Conclusion

As the demand for data analysis and management continues to grow, it’s important for teams to work collaboratively and effectively across different programming languages. SQL and Python are both powerful tools for handling data, and teams can leverage their strengths to extract valuable insights. Ibis is my library of choice to unlock the power of both languages, maximizing productivity. At Voltron Data we promote collaboration and data transparency, helping teams unlock the full potential of their data and drive better business outcomes.

If you’re working with Ibis and want to accelerate your success with the project, contact us to learn about our enterprise support options.

Photo by Deva Darshan