Voltron Data Logo
About

Learn more about our company

Contact Us

Get in touch with our team

Theseus

  • How it Works

    Core concepts and architecture overview

  • Control Plane

    Kubernetes deployment guide and best practices

  • Query Profiler

    Analyze and optimize query performance

Arrow
Loading...

In-memory columnar data processing

Ibis
Loading...

Python dataframe API for multiple backends

RAPIDS
Loading...

GPU-accelerated data science and analytics

Dev Blog

Latest updates and technical insights

Benchmarks Report

Read about our 2024 benchmarks for our data engine, Theseus.

The Composable Codex

A 5-part guide to understanding composable

Try Theseus

Product

  • How it Works
  • Control Plane
  • Query Profiler

Resources

  • Blog
  • Composable Codex
  • Benchmarks

Getting Started

  • Test Drive

Theseus

Built for AI workloads, Theseus is a high-performance SQL engine with GPU acceleration.

© 2021-2025 Voltron Data, Inc. All rights reserved.

Terms of ServicePrivacy PolicyCookie Policy
Voltron Data Logo
About

Learn more about our company

Contact Us

Get in touch with our team

Theseus

  • How it Works

    Core concepts and architecture overview

  • Control Plane

    Kubernetes deployment guide and best practices

  • Query Profiler

    Analyze and optimize query performance

Arrow
Loading...

In-memory columnar data processing

Ibis
Loading...

Python dataframe API for multiple backends

RAPIDS
Loading...

GPU-accelerated data science and analytics

Dev Blog

Latest updates and technical insights

Benchmarks Report

Read about our 2024 benchmarks for our data engine, Theseus.

The Composable Codex

A 5-part guide to understanding composable

Try Theseus

SQL and Data Frames Unite with Ibis

M

Marlene Mhangami

March 20, 2023
SQL and Data Frames Unite with Ibis

Work in Your Language of Choice
Multi-language tools allow users the freedom to work where they want, how they want, unleashing greater productivity. We design and build composable data systems that enable users to write code once, in any language, on any device, and at any scale to develop pipelines faster.

SQL or Python?

Choosing the right programming language for data warehousing pipelines can be challenging. SQL and Python are two of the most popular options, each with its own strengths and weaknesses. SQL is known for its performance and ability to handle structured data, while Python is flexible, easy-to-learn, and has a vast library of tools for data analysis, machine learning, and more. In the most recent Stack Overflow Developer Survey, a nearly equal percentage of developers reported using SQL or Python extensively in their work. Teams often choose to use one or both of these languages based on specific use cases or preferences.

Stack Overflow Developer Survey 2022 Top Technologies Section

However, when using different languages, collaborating can become challenging. To work together effectively, teams need to bridge the language gap and ensure that everyone is on the same page. In this blog post, we’ll explore how Ibis provides a uniform Python API for data warehousing tools and generates SQL that can be shared with team members or friends. By using Ibis, teams can work more efficiently, regardless of the programming languages they use.

Bridging the Gap with Ibis

Ibis is a Python library that provides a lightweight, universal interface for data wrangling. It helps Python users explore and transform data of any size, stored anywhere. To do this, under the hood Ibis compiles and generates SQL for many backends, including MySQL, PostgresSQL, DuckDB, PySpark, and BigQuery, amongst others. Although this happens behind the scenes, the generated SQL is easily accessible using the ibis.show_sql feature! Let’s look at a practical example of this.

Python Analytics with Ibis

We’ll start by using the new Ibis examples (ex) feature to grab the ‘palmer penguins’ dataset. We can import the ex module from ibis.interactive and access the penguins dataset by running ex.penguins.fetch(). Here’s the code we need to do this:

python
In [8]: from ibis.interactive import ex

In [11]: t = ex.penguins.fetch()

In [12]: t
Out[12]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string  │ string    │ float64        │ float64       │ int64             │ int64       │ string │ int64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male   │  2007 │
│ Adelie  │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female │  2007 │
│ Adelie  │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female │  2007 │
│ Adelie  │ Torgersen │            nan │           nan │                 ∅ │           ∅ │ ∅      │  2007 │
│ Adelie  │ Torgersen │           36.7 │          19.3 │               193 │        3450 │ female │  2007 │
│ Adelie  │ Torgersen │           39.3 │          20.6 │               190 │        3650 │ male   │  2007 │
│ Adelie  │ Torgersen │           38.9 │          17.8 │               181 │        3625 │ female │  2007 │
│ Adelie  │ Torgersen │           39.2 │          19.6 │               195 │        4675 │ male   │  2007 │
│ Adelie  │ Torgersen │           34.1 │          18.1 │               193 │        3475 │ ∅      │  2007 │
│ Adelie  │ Torgersen │           42.0 │          20.2 │               190 │        4250 │ ∅      │  2007 │
│ …       │ …         │              … │             … │                 … │           … │ …      │     … │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

Next, we’ll use a single Ibis expression to do some fairly complex analytics work. Our code will:

  • compute the z-score of every numeric column and replace the existing data with that normalized value
  • group the data by species and min/max normalization
  • filter the resulting table so we keep rows where all the bill size z-score related columns’ absolute values are greater than 2

This is what the filter expression looks like:

python

In [78]: t.drop("year").group_by("species").mutate(
    ...:     s.across(s.numeric(), dict(zscore=(_ - _.mean()) / _.std()))
    ...: ).filter(s.if_all(s.startswith("bill") & s.endswith("_zscore"), _.abs() > 2))
Out[78]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ bill_length_mm_zscore ┃ bill_depth_mm_zscore ┃ flipper_length_mm_zscore ┃ body_mass_g_zscore ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ string  │ string    │ float64        │ float64       │ int64             │ int64       │ string │ float64               │ float64              │ float64                  │ float64            │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────────────────────┼──────────────────────┼──────────────────────────┼────────────────────┤
│ Gentoo  │ Biscoe    │           59.6 │          17.0 │               230 │        6050 │ male   │              3.924621 │             2.056508 │                 1.975799 │           1.932062 │
│ Gentoo  │ Biscoe    │           55.9 │          17.0 │               228 │        5600 │ male   │              2.724046 │             2.056508 │                 1.667394 │           1.039411 │
│ Adelie  │ Torgersen │           46.0 │          21.5 │               194 │        4200 │ male   │              2.706539 │             2.592071 │                 0.618760 │           1.088911 │
│ Adelie  │ Dream     │           32.1 │          15.5 │               188 │        3050 │ female │             -2.512345 │            -2.339505 │                -0.298747 │          -1.418906 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────────────────────┴──────────────────────┴──────────────────────────┴────────────────────┘```

With Ibis we are able to express complex filters more concisely, in this case in just one line of code!

SQL Generation with Ibis

As mentioned earlier, if we’re working with another team that needs to access the same information but would prefer using SQL, Ibis can generate this code for us too. This is the code we’d need to run in order to do this:

python
In [79]: ibis.show_sql(
    ...:     t.drop("year")
    ...:     .group_by("species")
    ...:     .mutate(s.across(s.numeric(), dict(zscore=(_ - _.mean()) / _.std())))
    ...:     .filter(s.if_all(s.startswith("bill") & s.endswith("_zscore"), _.abs() > 2))
    ...: )

Out [80]:
WITH t0 AS (
  SELECT
    t2.species AS species,
    t2.island AS island,
    t2.bill_length_mm AS bill_length_mm,
    t2.bill_depth_mm AS bill_depth_mm,
    t2.flipper_length_mm AS flipper_length_mm,
    t2.body_mass_g AS body_mass_g,
    t2.sex AS sex
  FROM ibis_read_csv_3 AS t2
), t1 AS (
  SELECT
    t0.species AS species,
    t0.island AS island,
    t0.bill_length_mm AS bill_length_mm,
    t0.bill_depth_mm AS bill_depth_mm,
    t0.flipper_length_mm AS flipper_length_mm,
    t0.body_mass_g AS body_mass_g,
    t0.sex AS sex,
    (
      t0.bill_length_mm - AVG(t0.bill_length_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP(t0.bill_length_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS bill_length_mm_zscore,
    (
      t0.bill_depth_mm - AVG(t0.bill_depth_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP(t0.bill_depth_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS bill_depth_mm_zscore,
    (
      t0.flipper_length_mm - AVG(t0.flipper_length_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP(t0.flipper_length_mm) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS flipper_length_mm_zscore,
    (
      t0.body_mass_g - AVG(t0.body_mass_g) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ) / STDDEV_SAMP(t0.body_mass_g) OVER (PARTITION BY t0.species ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS body_mass_g_zscore
  FROM t0
)
SELECT
  t1.species,
  t1.island,
  t1.bill_length_mm,
  t1.bill_depth_mm,
  t1.flipper_length_mm,
  t1.body_mass_g,
  t1.sex,
  t1.bill_length_mm_zscore,
  t1.bill_depth_mm_zscore,
  t1.flipper_length_mm_zscore,
  t1.body_mass_g_zscore
FROM t1
WHERE
  ABS(t1.bill_length_mm_zscore) > CAST(2 AS SMALLINT)
  AND ABS(t1.bill_depth_mm_zscore) > CAST(2 AS SMALLINT)

The generated SQL query is clearly much longer than the Ibis expression, but is fairly trivial to generate using ibis.show_sql.

Bonus: From SQL Back to Ibis

As an added bonus, Ibis can take in any SQL query and turn it into an Ibis expression. To show how this works, let’s take the expression we used above, turn it into SQL, and then turn it straight back into the original expression. Here’s the code:

python
In [16]: expr = (
    ...:     t.drop("year")
    ...:     .group_by("species")
    ...:     .mutate(s.across(s.numeric(), dict(zscore=(_ - _.mean()) / _.std())))
    ...:     .filter(s.if_all(s.startswith("bill") & s.endswith("_zscore"), _.abs() > 2))
    ...: )

#Lets convert expression to SQL and store this in the SQL variable
In [17]: sql = str(ibis.to_sql(expr))

#Lets see what is in the sql variable
In [18]: sql[:30]
#Looks like we have SQL
Out[18]: 'WITH t0 AS (\n  SELECT\n    t2.s'

#Now lets ingest this SQL string back into Ibis and use it
In [19]: ibis.get_backend().sql(sql)
Out[19]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ bill_length_mm_zscore ┃ bill_depth_mm_zscore ┃ flipper_length_mm_zscore ┃ body_mass_g_zscore ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ string  │ string    │ float64        │ float64       │ int64             │ int64       │ string │ float64               │ float64              │ float64                  │ float64            │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────────────────────┼──────────────────────┼──────────────────────────┼────────────────────┤
│ Gentoo  │ Biscoe    │           55.9 │          17.0 │               228 │        5600 │ male   │              2.724046 │             2.056508 │                 1.667394 │           1.039411 │
│ Gentoo  │ Biscoe    │           59.6 │          17.0 │               230 │        6050 │ male   │              3.924621 │             2.056508 │                 1.975799 │           1.932062 │
│ Adelie  │ Torgersen │           46.0 │          21.5 │               194 │        4200 │ male   │              2.706539 │             2.592071 │                 0.618760 │           1.088911 │
│ Adelie  │ Dream     │           32.1 │          15.5 │               188 │        3050 │ female │             -2.512345 │            -2.339505 │                -0.298747 │          -1.418906 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────────────────────┴──────────────────────┴──────────────────────────┴────────────────────┘

#And lets compare this with the original expression generated
In [20]: expr
Out[20]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ bill_length_mm_zscore ┃ bill_depth_mm_zscore ┃ flipper_length_mm_zscore ┃ body_mass_g_zscore ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ string  │ string    │ float64        │ float64       │ int64             │ int64       │ string │ float64               │ float64              │ float64                  │ float64            │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────────────────────┼──────────────────────┼──────────────────────────┼────────────────────┤
│ Gentoo  │ Biscoe    │           55.9 │          17.0 │               228 │        5600 │ male   │              2.724046 │             2.056508 │                 1.667394 │           1.039411 │
│ Gentoo  │ Biscoe    │           59.6 │          17.0 │               230 │        6050 │ male   │              3.924621 │             2.056508 │                 1.975799 │           1.932062 │
│ Adelie  │ Torgersen │           46.0 │          21.5 │               194 │        4200 │ male   │              2.706539 │             2.592071 │                 0.618760 │           1.088911 │
│ Adelie  │ Dream     │           32.1 │          15.5 │               188 │        3050 │ female │             -2.512345 │            -2.339505 │                -0.298747 │          -1.418906 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────────────────────┴──────────────────────┴──────────────────────────┴────────────────────┘

Of course, we are not surprised to see the results match, but it is important to note that Ibis can ingest any SQL expression you give it, not just the one it generated. Imagine the collaboration opportunities this could bring!

Conclusion

As the demand for data analysis and management continues to grow, it’s important for teams to work collaboratively and effectively across different programming languages. SQL and Python are both powerful tools for handling data, and teams can leverage their strengths to extract valuable insights. Ibis is my library of choice to unlock the power of both languages, maximizing productivity. At Voltron Data we promote collaboration and data transparency, helping teams unlock the full potential of their data and drive better business outcomes.

If you’re working with Ibis and want to accelerate your success with the project, contact us to learn about our enterprise support options.

Photo by Deva Darshan

Product

  • How it Works
  • Control Plane
  • Query Profiler

Resources

  • Blog
  • Composable Codex
  • Benchmarks

Getting Started

  • Test Drive

Theseus

Built for AI workloads, Theseus is a high-performance SQL engine with GPU acceleration.

© 2021-2025 Voltron Data, Inc. All rights reserved.

Terms of ServicePrivacy PolicyCookie Policy