May 11, 2023

Make pandas Faster with DuckDB

Phillip Cloud and Kae Suarez

train platform timeplase of train going by
note icon
TL;DR Accelerating analytics gets you more insights faster, saving money on resources and unlocking profitable new directions. Accelerating without sacrificing your current resources saves even more, and Ibis can get you there.

pandas & DuckDB

pandas is a powerful, reliable tool. Many keep it in the front of their toolbox for its ease with file ingestion and data wrangling. It has become the de facto standard for how dataframes should look in Python, inspiring other libraries to have similar interfaces. However, in some intensive workflows, pandas can slow or crash, especially with high data volumes.

DuckDB, on the other hand, is an efficient and flexible in-memory database system. It is very fast in local execution and can handle datasets and queries far beyond pandas’ capability. Without Ibis, though, DuckDB doesn’t have a Pythonic Dataframe API, which is a large part of the appeal of using pandas in the first place. Furthermore, if you already have a workflow that relies on generating pandas Dataframes as your unit of data, refactoring to rely on DuckDB instead takes engineer hours — even with Ibis! Ibis is not a drop-in replacement for pandas, and refactoring can be a costly task.

What if we could marry these? Use pandas to its limits, then cross the finish line using DuckDB? Without ever refactoring the underlying code, only the last mile?

Finishing the Last Mile with Ibis

How do you use it?

The memtable API in Ibis implements a way to load pandas Dataframes quickly and easily into supported database systems, enabling users to upgrade performance the exact moment they need it. Here, we’ll use DuckDB — it’s the default, but we’ll intentionally set it anyways.

What does quickly and easily mean, though? Well, first, let’s set up.

import pandas as pd
import numpy as np
import ibis

ibis.set_backend("duckdb")

ibis.options.interactive = True

df = pd.read_parquet("nyc_taxi/year=2020/month=1/part0.parquet")
df

Out:
vendor_name	pickup_datetime	dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	rate_code	store_and_fwd	dropoff_longitude	...	fare_amount	extra	mta_tax	tip_amount	tolls_amount	total_amount	improvement_surcharge	congestion_surcharge	pickup_location_id	dropoff_location_id
0	VTS	2020-01-01 09:40:30	2020-01-01 09:48:29	1.0	1.74	NaN	NaN	Standard rate	No	NaN	...	8.0	0.0	0.5	2.26	0.0	13.56	0.3	2.5	163	186
1	CMT	2020-01-01 09:15:31	2020-01-01 09:33:08	1.0	4.30	NaN	NaN	Standard rate	No	NaN	...	17.0	2.5	0.5	0.00	0.0	20.30	0.3	2.5	151	68
2	CMT	2020-01-01 09:53:53	2020-01-01 10:01:22	1.0	1.10	NaN	NaN	Standard rate	Yes	NaN	...	6.5	2.5	0.5	0.00	0.0	9.80	0.3	2.5	229	230
3	CMT	2020-01-01 09:11:33	2020-01-01 09:18:23	2.0	1.10	NaN	NaN	Standard rate	No	NaN	...	7.0	2.5	0.5	0.00	0.0	10.30	0.3	2.5	48	164
4	CMT	2020-01-01 09:30:43	2020-01-01 09:37:22	2.0	0.90	NaN	NaN	Standard rate	No	NaN	...	6.5	2.5	0.5	1.00	0.0	10.80	0.3	2.5	48	161
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6404791	VTS	2020-01-31 21:24:34	2020-01-31 21:35:48	2.0	1.55	NaN	NaN	Standard rate	No	NaN	...	9.0	0.5	0.5	0.00	0.0	12.80	0.3	2.5	239	24
6404792	VTS	2020-01-31 21:52:22	2020-01-31 22:50:49	1.0	17.11	NaN	NaN	Standard rate	No	NaN	...	53.0	0.5	0.5	4.00	0.0	60.80	0.3	2.5	161	22
6404793	VTS	2020-01-31 21:12:02	2020-01-31 21:20:07	1.0	0.97	NaN	NaN	Standard rate	No	NaN	...	7.0	0.5	0.5	1.08	0.0	11.88	0.3	2.5	170	170
6404794	VTS	2020-01-31 21:11:17	2020-01-31 21:20:10	1.0	1.73	NaN	NaN	Standard rate	No	NaN	...	8.0	0.5	0.5	1.00	0.0	12.80	0.3	2.5	229	263
6404795	VTS	2020-01-31 21:32:44	2020-01-31 22:08:54	1.0	5.77	NaN	NaN	Standard rate	No	NaN	...	26.5	0.5	0.5	4.00	0.0	34.30	0.3	2.5	237	223
6404796 rows × 22 columns

There are our imports, configuration, and Dataframe read.

Now, how hard is it to get this into DuckDB with Ibis?


t = ibis.memtable(df)
t

Out:

┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
 vendor_name  pickup_datetime      dropoff_datetime     passenger_count  trip_distance  pickup_longitude  pickup_latitude  rate_code      store_and_fwd  dropoff_longitude  dropoff_latitude  payment_type  fare_amount  extra    mta_tax  tip_amount  tolls_amount  total_amount  improvement_surcharge  congestion_surcharge  pickup_location_id  dropoff_location_id 
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
 string       timestamp            timestamp            float64          float64        float64           float64          string         string         float64            float64           string        float64      float64  float64  float64     float64       float64       float64                float64               int64               int64               
├─────────────┼─────────────────────┼─────────────────────┼─────────────────┼───────────────┼──────────────────┼─────────────────┼───────────────┼───────────────┼───────────────────┼──────────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼──────────────┼───────────────────────┼──────────────────────┼────────────────────┼─────────────────────┤
 VTS          2020-01-01 09:40:30  2020-01-01 09:48:29              1.0           1.74               nan              nan  Standard rate  No                           nan               nan  Credit card           8.0      0.0      0.5        2.26          0.00         13.56                    0.3                   2.5                 163                  186 
 CMT          2020-01-01 09:15:31  2020-01-01 09:33:08              1.0           4.30               nan              nan  Standard rate  No                           nan               nan  Cash                 17.0      2.5      0.5        0.00          0.00         20.30                    0.3                   2.5                 151                   68 
 CMT          2020-01-01 09:53:53  2020-01-01 10:01:22              1.0           1.10               nan              nan  Standard rate  Yes                          nan               nan  Cash                  6.5      2.5      0.5        0.00          0.00          9.80                    0.3                   2.5                 229                  230 
 CMT          2020-01-01 09:11:33  2020-01-01 09:18:23              2.0           1.10               nan              nan  Standard rate  No                           nan               nan  Cash                  7.0      2.5      0.5        0.00          0.00         10.30                    0.3                   2.5                  48                  164 
 CMT          2020-01-01 09:30:43  2020-01-01 09:37:22              2.0           0.90               nan              nan  Standard rate  No                           nan               nan  Credit card           6.5      2.5      0.5        1.00          0.00         10.80                    0.3                   2.5                  48                  161 
 VTS          2020-01-01 08:56:11  2020-01-01 09:18:25              1.0          15.35               nan              nan  Standard rate  No                           nan               nan  Cash                 42.0      0.0      0.5        0.00          0.00         42.80                    0.3                   0.0                 132                  150 
 VTS          2020-01-01 09:49:08  2020-01-01 09:53:57              1.0           1.44               nan              nan  Standard rate  No                           nan               nan  Credit card           6.0      0.0      0.5        1.40          0.00         10.70                    0.3                   2.5                  50                   50 
 CMT          2020-01-01 09:00:44  2020-01-01 09:05:28              2.0           0.30               nan              nan  Standard rate  No                           nan               nan  Cash                  3.5      2.5      0.5        0.00          0.00          6.80                    0.3                   2.5                  50                   48 
 CMT          2020-01-01 09:06:31  2020-01-01 09:28:44              1.0           7.70               nan              nan  Standard rate  No                           nan               nan  Cash                 24.5      2.5      0.5        0.00          6.12         33.92                    0.3                   2.5                  48                   80 
 CMT          2020-01-01 09:46:49  2020-01-01 09:55:19              1.0           2.10               nan              nan  Standard rate  No                           nan               nan  Credit card           9.0      2.5      0.5        1.00          0.00         13.30                    0.3                   2.5                  68                  114 
                                                                                                                                                                                                                                                                                                                                                  
└─────────────┴─────────────────────┴─────────────────────┴─────────────────┴───────────────┴──────────────────┴─────────────────┴───────────────┴───────────────┴───────────────────┴──────────────────┴──────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴──────────────┴───────────────────────┴──────────────────────┴────────────────────┴─────────────────────┘

Just as easy as passing the Dataframe, and now you have an in-memory DuckDB database holding the data from the pandas Dataframe.

A nice trick — but what does it get you?

Just for the sake of scale, we’ll generate data instead of gathering large files. We’ll generate in pandas with NumPy:

df = pd.DataFrame({"a": np.random.randn(int(1e8)), 'b': ['a', 'b'] * int(1e8 // 2)}).astype({'b': 'string'})

With data on hand to the tune of 8 zeroes of scale, we can perform an operation. We’ll use value_counts(), since this data is rather long, and the API is similar across pandas and Ibis.

First, with pandas:

%%time 
df.b.value_counts()

Out:
CPU times: user 1.73 s, sys: 29.3 ms, total: 1.76 s
Wall time: 1.75 s
b
a    50000000
b    50000000
Name: count, dtype: Int64

1.75 seconds — not a bad start, but you can feel it. Can DuckDB make it better?

%%time
t.b.value_counts().execute()

Out:
CPU times: user 2.26 s, sys: 114 ms, total: 2.37 s
Wall time: 1 s
b	b_count
0	a	50000000
1	b	50000000

Almost a 2x speedup, and using pandas datatypes! If you’re willing to take the extra step, you could just use the default object datatype, and DuckDB can offer even more. The pandas datatypes are useful if you stay in pandas, but it’s worth considering the alternative if your workflow often includes switches to other systems.

First, let’s regenerate, without the pandas string:

df = pd.DataFrame({"a": np.random.randn(int(1e8)), 'b': ['a', 'b'] * int(1e8 // 2)})

For pandas, performance stays effectively the same:

%%time 
df.b.value_counts()

Out:
CPU times: user 1.76 s, sys: 12.4 ms, total: 1.77 s
Wall time: 1.77 s
b
a    50000000
b    50000000
Name: count, dtype: int64

DuckDB, on the other hand, benefits massively, dropping to only 341 ms for execution:

%%time
t.b.value_counts().execute()

Out:
CPU times: user 1.64 s, sys: 20.6 ms, total: 1.66 s
Wall time: 341 ms
b	b_count
0	a	50000000
1	b	50000000

Either way, moving pandas Dataframes into DuckDB with Ibis is a quick and easy way to get more performant analytics. Though, one may ask: Why?

Why do this?

By leaving most of your code as is, but using Ibis memtable for analytics when pandas isn’t enough, you can reap the benefits of stronger database technologies without sacrificing your existing codebase. Making your analytics faster and more robust reduces the costs of execution while enabling new opportunities to find important insights. Keeping your current data workflow and extending it while staying in Python removes the cost of training for vastly different technologies, and enables continued use of existing code and resources. Having it all compounds into more insights for fewer hours and less money, and Ibis helps you get there.

Conclusion

With Ibis, you can bring your existing pandas-based code and keep using it — but never be locked in. You can save time and money by using your existing resources, and explore new horizons by piping your outputs into powerful database systems, from DuckDB to Snowflake.

Our very own Phillip Cloud showcased this very concept with the release of Ibis 3.2 in September 2022. If you prefer to learn by video, make sure to watch him at: https://www.youtube.com/watch?v=ECGUBW-Px6o.

We think constantly refactoring and rewriting hundreds of lines of code isn’t always the right approach. Targeting your efforts to get the most benefit for the least investment is the optimal way forward. Using standards like Apache Arrow and Substrait, and powerful tools like Ibis and ADBC, we work with enterprises to identify and overcome pain points by determining exactly what’s needed to make your stack excel. Learn how Voltron Data Enterprise Support can help you move forward, faster.

Photo by Jan Vasek