Apr 19, 2023
When Scale Matters, Don't Wait on pandas...
pandas: The Python Dataframe Library
pandas has been a critical part of a Pythonic life with data for a long time. With a popular API that truly sets a Python data-processing application apart from, say, C++, it’s understandable that pandas has been, and will remain, a good friend to many.
However, at scale, data outgrows pandas. Certainly, there are solutions that claim to scale pandas, but these are closed-source and come with a vendor lock-in risk. Open source software is a key component of a modern stack, and this is why we see Ibis as the premiere dataframe API option for scale.
It’s tempting to stay with your current toolset and wait on adopting new technology — after all, they’ll get there, and they’ve done well to carry many programmers and data scientists far. But the world of Big Data moves quickly, and there may be a significant opportunity cost in waiting for tools to catch up to modern needs.
Don’t Wait on pandas…
…to Decouple API from Compute.
When you use most tools, you get their compute options. Many mimic the pandas dataframe API but implement their own specific compute options, which leaves you locked wherever you go. Ibis is the only true portable Python Dataframe interface that never ties you down and lets you move between several popular engines (ever heard of Trino?).
…to be the Fastest Engine.
The classics are improving with time, and becoming better at handling larger data, especially with efforts to rely more heavily on Apache Arrow. But, they aren’t there yet. Ibis already targets fast columnar engines, such as Snowflake. It even addresses pandas as a backend.
…to Scale to Many Nodes.
pandas is a one-node tool, and very good at that! However, it’s specialized for local use, not for scaling. The efforts that try to make it scale, even Dask, are limited by design decisions in the pandas API. There are already columnar engines that scale right now (such as, again, Snowflake), and Ibis provides access to them!
Grow Right Now with Ibis
Ibis is ready today — and this means you don’t need to wait take advantage of the modern features you need. Whether your data is gigabyte-big, terabyte-big, or petabyte-big, scaling across nodes and up data isn’t the future, it’s the present. Ibis is flexible, and the front door to entering this field of fast, scalable compute, without giving up on intuitive Dataframe APIs.
All of this is achieved with open source, and behind Ibis is an active team that is driven by innovation and pushing new features forward. The progress happening daily on the Ibis project is evident and shows how powerful open source is when it comes to transparency, innovation, and collaboration.
Photo by Sam Poullain