May 11, 2023
Make pandas Faster with DuckDB
Phillip Cloud and Kae Suarez

pandas & DuckDB
pandas is a powerful, reliable tool. Many keep it in the front of their toolbox for its ease with file ingestion and data wrangling. It has become the de facto standard for how dataframes should look in Python, inspiring other libraries to have similar interfaces. However, in some intensive workflows, pandas can slow or crash, especially with high data volumes.
DuckDB, on the other hand, is an efficient and flexible in-memory database system. It is very fast in local execution and can handle datasets and queries far beyond pandas’ capability. Without Ibis, though, DuckDB doesn’t have a Pythonic Dataframe API, which is a large part of the appeal of using pandas in the first place. Furthermore, if you already have a workflow that relies on generating pandas Dataframes as your unit of data, refactoring to rely on DuckDB instead takes engineer hours — even with Ibis! Ibis is not a drop-in replacement for pandas, and refactoring can be a costly task.
What if we could marry these? Use pandas to its limits, then cross the finish line using DuckDB? Without ever refactoring the underlying code, only the last mile?
Finishing the Last Mile with Ibis
How do you use it?
The memtable API in Ibis implements a way to load pandas Dataframes quickly and easily into supported database systems, enabling users to upgrade performance the exact moment they need it. Here, we’ll use DuckDB — it’s the default, but we’ll intentionally set it anyways.
What does quickly and easily mean, though? Well, first, let’s set up.
import pandas as pd
import numpy as np
import ibis
ibis.set_backend("duckdb")
ibis.options.interactive = True
df = pd.read_parquet("nyc_taxi/year=2020/month=1/part0.parquet")
df
Out:
vendor_name pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd dropoff_longitude ... fare_amount extra mta_tax tip_amount tolls_amount total_amount improvement_surcharge congestion_surcharge pickup_location_id dropoff_location_id
0 VTS 2020-01-01 09:40:30 2020-01-01 09:48:29 1.0 1.74 NaN NaN Standard rate No NaN ... 8.0 0.0 0.5 2.26 0.0 13.56 0.3 2.5 163 186
1 CMT 2020-01-01 09:15:31 2020-01-01 09:33:08 1.0 4.30 NaN NaN Standard rate No NaN ... 17.0 2.5 0.5 0.00 0.0 20.30 0.3 2.5 151 68
2 CMT 2020-01-01 09:53:53 2020-01-01 10:01:22 1.0 1.10 NaN NaN Standard rate Yes NaN ... 6.5 2.5 0.5 0.00 0.0 9.80 0.3 2.5 229 230
3 CMT 2020-01-01 09:11:33 2020-01-01 09:18:23 2.0 1.10 NaN NaN Standard rate No NaN ... 7.0 2.5 0.5 0.00 0.0 10.30 0.3 2.5 48 164
4 CMT 2020-01-01 09:30:43 2020-01-01 09:37:22 2.0 0.90 NaN NaN Standard rate No NaN ... 6.5 2.5 0.5 1.00 0.0 10.80 0.3 2.5 48 161
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6404791 VTS 2020-01-31 21:24:34 2020-01-31 21:35:48 2.0 1.55 NaN NaN Standard rate No NaN ... 9.0 0.5 0.5 0.00 0.0 12.80 0.3 2.5 239 24
6404792 VTS 2020-01-31 21:52:22 2020-01-31 22:50:49 1.0 17.11 NaN NaN Standard rate No NaN ... 53.0 0.5 0.5 4.00 0.0 60.80 0.3 2.5 161 22
6404793 VTS 2020-01-31 21:12:02 2020-01-31 21:20:07 1.0 0.97 NaN NaN Standard rate No NaN ... 7.0 0.5 0.5 1.08 0.0 11.88 0.3 2.5 170 170
6404794 VTS 2020-01-31 21:11:17 2020-01-31 21:20:10 1.0 1.73 NaN NaN Standard rate No NaN ... 8.0 0.5 0.5 1.00 0.0 12.80 0.3 2.5 229 263
6404795 VTS 2020-01-31 21:32:44 2020-01-31 22:08:54 1.0 5.77 NaN NaN Standard rate No NaN ... 26.5 0.5 0.5 4.00 0.0 34.30 0.3 2.5 237 223
6404796 rows × 22 columns
There are our imports, configuration, and Dataframe read.
Now, how hard is it to get this into DuckDB with Ibis?
t = ibis.memtable(df)
t
Out:
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ vendor_name ┃ pickup_datetime ┃ dropoff_datetime ┃ passenger_count ┃ trip_distance ┃ pickup_longitude ┃ pickup_latitude ┃ rate_code ┃ store_and_fwd ┃ dropoff_longitude ┃ dropoff_latitude ┃ payment_type ┃ fare_amount ┃ extra ┃ mta_tax ┃ tip_amount ┃ tolls_amount ┃ total_amount ┃ improvement_surcharge ┃ congestion_surcharge ┃ pickup_location_id ┃ dropoff_location_id ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ string │ timestamp │ timestamp │ float64 │ float64 │ float64 │ float64 │ string │ string │ float64 │ float64 │ string │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ int64 │ int64 │
├─────────────┼─────────────────────┼─────────────────────┼─────────────────┼───────────────┼──────────────────┼─────────────────┼───────────────┼───────────────┼───────────────────┼──────────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼──────────────┼───────────────────────┼──────────────────────┼────────────────────┼─────────────────────┤
│ VTS │ 2020-01-01 09:40:30 │ 2020-01-01 09:48:29 │ 1.0 │ 1.74 │ nan │ nan │ Standard rate │ No │ nan │ nan │ Credit card │ 8.0 │ 0.0 │ 0.5 │ 2.26 │ 0.00 │ 13.56 │ 0.3 │ 2.5 │ 163 │ 186 │
│ CMT │ 2020-01-01 09:15:31 │ 2020-01-01 09:33:08 │ 1.0 │ 4.30 │ nan │ nan │ Standard rate │ No │ nan │ nan │ Cash │ 17.0 │ 2.5 │ 0.5 │ 0.00 │ 0.00 │ 20.30 │ 0.3 │ 2.5 │ 151 │ 68 │
│ CMT │ 2020-01-01 09:53:53 │ 2020-01-01 10:01:22 │ 1.0 │ 1.10 │ nan │ nan │ Standard rate │ Yes │ nan │ nan │ Cash │ 6.5 │ 2.5 │ 0.5 │ 0.00 │ 0.00 │ 9.80 │ 0.3 │ 2.5 │ 229 │ 230 │
│ CMT │ 2020-01-01 09:11:33 │ 2020-01-01 09:18:23 │ 2.0 │ 1.10 │ nan │ nan │ Standard rate │ No │ nan │ nan │ Cash │ 7.0 │ 2.5 │ 0.5 │ 0.00 │ 0.00 │ 10.30 │ 0.3 │ 2.5 │ 48 │ 164 │
│ CMT │ 2020-01-01 09:30:43 │ 2020-01-01 09:37:22 │ 2.0 │ 0.90 │ nan │ nan │ Standard rate │ No │ nan │ nan │ Credit card │ 6.5 │ 2.5 │ 0.5 │ 1.00 │ 0.00 │ 10.80 │ 0.3 │ 2.5 │ 48 │ 161 │
│ VTS │ 2020-01-01 08:56:11 │ 2020-01-01 09:18:25 │ 1.0 │ 15.35 │ nan │ nan │ Standard rate │ No │ nan │ nan │ Cash │ 42.0 │ 0.0 │ 0.5 │ 0.00 │ 0.00 │ 42.80 │ 0.3 │ 0.0 │ 132 │ 150 │
│ VTS │ 2020-01-01 09:49:08 │ 2020-01-01 09:53:57 │ 1.0 │ 1.44 │ nan │ nan │ Standard rate │ No │ nan │ nan │ Credit card │ 6.0 │ 0.0 │ 0.5 │ 1.40 │ 0.00 │ 10.70 │ 0.3 │ 2.5 │ 50 │ 50 │
│ CMT │ 2020-01-01 09:00:44 │ 2020-01-01 09:05:28 │ 2.0 │ 0.30 │ nan │ nan │ Standard rate │ No │ nan │ nan │ Cash │ 3.5 │ 2.5 │ 0.5 │ 0.00 │ 0.00 │ 6.80 │ 0.3 │ 2.5 │ 50 │ 48 │
│ CMT │ 2020-01-01 09:06:31 │ 2020-01-01 09:28:44 │ 1.0 │ 7.70 │ nan │ nan │ Standard rate │ No │ nan │ nan │ Cash │ 24.5 │ 2.5 │ 0.5 │ 0.00 │ 6.12 │ 33.92 │ 0.3 │ 2.5 │ 48 │ 80 │
│ CMT │ 2020-01-01 09:46:49 │ 2020-01-01 09:55:19 │ 1.0 │ 2.10 │ nan │ nan │ Standard rate │ No │ nan │ nan │ Credit card │ 9.0 │ 2.5 │ 0.5 │ 1.00 │ 0.00 │ 13.30 │ 0.3 │ 2.5 │ 68 │ 114 │
│ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │
└─────────────┴─────────────────────┴─────────────────────┴─────────────────┴───────────────┴──────────────────┴─────────────────┴───────────────┴───────────────┴───────────────────┴──────────────────┴──────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴──────────────┴───────────────────────┴──────────────────────┴────────────────────┴─────────────────────┘
Just as easy as passing the Dataframe, and now you have an in-memory DuckDB database holding the data from the pandas Dataframe.
A nice trick — but what does it get you?
Just for the sake of scale, we’ll generate data instead of gathering large files. We’ll generate in pandas with NumPy:
df = pd.DataFrame({"a": np.random.randn(int(1e8)), 'b': ['a', 'b'] * int(1e8 // 2)}).astype({'b': 'string'})
With data on hand to the tune of 8 zeroes of scale, we can perform an operation. We’ll use value_counts
(), since this data is rather long, and the API is similar across pandas and Ibis.
First, with pandas:
%%time
df.b.value_counts()
Out:
CPU times: user 1.73 s, sys: 29.3 ms, total: 1.76 s
Wall time: 1.75 s
b
a 50000000
b 50000000
Name: count, dtype: Int64
1.75 seconds — not a bad start, but you can feel it. Can DuckDB make it better?
%%time
t.b.value_counts().execute()
Out:
CPU times: user 2.26 s, sys: 114 ms, total: 2.37 s
Wall time: 1 s
b b_count
0 a 50000000
1 b 50000000
Almost a 2x speedup, and using pandas datatypes! If you’re willing to take the extra step, you could just use the default object datatype, and DuckDB can offer even more. The pandas datatypes are useful if you stay in pandas, but it’s worth considering the alternative if your workflow often includes switches to other systems.
First, let’s regenerate, without the pandas string:
df = pd.DataFrame({"a": np.random.randn(int(1e8)), 'b': ['a', 'b'] * int(1e8 // 2)})
For pandas, performance stays effectively the same:
%%time
df.b.value_counts()
Out:
CPU times: user 1.76 s, sys: 12.4 ms, total: 1.77 s
Wall time: 1.77 s
b
a 50000000
b 50000000
Name: count, dtype: int64
DuckDB, on the other hand, benefits massively, dropping to only 341 ms for execution:
%%time
t.b.value_counts().execute()
Out:
CPU times: user 1.64 s, sys: 20.6 ms, total: 1.66 s
Wall time: 341 ms
b b_count
0 a 50000000
1 b 50000000
Either way, moving pandas Dataframes into DuckDB with Ibis is a quick and easy way to get more performant analytics. Though, one may ask: Why?
Why do this?
By leaving most of your code as is, but using Ibis memtable for analytics when pandas isn’t enough, you can reap the benefits of stronger database technologies without sacrificing your existing codebase. Making your analytics faster and more robust reduces the costs of execution while enabling new opportunities to find important insights. Keeping your current data workflow and extending it while staying in Python removes the cost of training for vastly different technologies, and enables continued use of existing code and resources. Having it all compounds into more insights for fewer hours and less money, and Ibis helps you get there.
Conclusion
With Ibis, you can bring your existing pandas-based code and keep using it — but never be locked in. You can save time and money by using your existing resources, and explore new horizons by piping your outputs into powerful database systems, from DuckDB to Snowflake.
Our very own Phillip Cloud showcased this very concept with the release of Ibis 3.2 in September 2022. If you prefer to learn by video, make sure to watch him at: https://www.youtube.com/watch?v=ECGUBW-Px6o.
We think constantly refactoring and rewriting hundreds of lines of code isn’t always the right approach. Targeting your efforts to get the most benefit for the least investment is the optimal way forward. Using standards like Apache Arrow and Substrait, and powerful tools like Ibis and ADBC, we work with enterprises to identify and overcome pain points by determining exactly what’s needed to make your stack excel. Learn how Voltron Data Enterprise Support can help you move forward, faster.
Photo by Jan Vasek