Oct 06, 2022
Quick Wins: Accelerating pandas CSV Reading with Apache Arrow
Apache Arrow offers large performance gains and extends what you can solve on a given system. However, it can be intimidating to start using. In this series, we explore low-difficulty, high-impact routes to leverage Apache Arrow in your data analysis workflow– often with only a couple extra lines of code!
Previously, we explored how a couple lines of code can significantly increase CSV file intake performance in R. This time, we’ll do the same in Python with pandas and show the gains possible from including Apache Arrow in your software stack.
The Status Quo: Reading CSVs in Python with pandas
If you work with data in Python, there’s a good chance that pandas may come into the mix. Be it for exploration or full-on processing, pandas is a popular way to look at data – and data often enters pandas in the form of a CSV being read with its
read_csv() utility. In small cases, much like any file I/O, this has a negligible impact on overall runtime. However, interesting data is often large, and this means that reading with pandas can take a noticeable amount of time.
Instead of just saying that, though, let’s prove it. We’ll use the charmingly-named “100 million data” dataset, which includes trade statistics from Japan from 1988 to 2019. This data is mostly of interest here due to being 4.54 GB of CSV. For execution, I’m using a Mac M2 laptop with 16 GB of RAM. The following cell reads this CSV and times the process:
Of course, pandas is good at offering power in small commands – the code is quick and to the point. The time spent running, on the other hand? Thirty seconds isn’t long for a simulation, but when working interactively, or iteratively modifying and re-running a script, this can add up quickly. Certainly, in some cases, it is possible to shrink the problem by splitting up the CSV, but let’s say that we want all these records at once. Could we do better, without giving up simplicity?
The Supercharged Flow: Reading CSVs in pandas with Arrow
The standard pandas
read_csv() is row-major in nature and single-threaded regardless of system configuration. Arrow reads in a column-major form, which better leverages SIMD processors than row-major reading can and has a heavy focus on multi-threading.
But how do weget Apache Arrow to our pandas code?
Turns out, that work is already done! While experimental,pandas offers the ability to use Pyarrow to read a CSV, by specifying one argument. This lets us leverage the benefits of column-major reads, as well as multi-threading. The code looks like the following:
That certainly is just one string argument. What do we get for it?
A 5X speedup. While still noticeable, six seconds is much more tolerable than half a minute, and this difference is only more noticeable when moving to larger data – or running more than once, which happens often in data analysis.
With only 18 characters, and not even a manual import of Arrow, you can now read CSVs much faster. This much gain is nice, especially so when you can stay in your existing pandas-based process, and without even attempting to leverage the rest of what Arrow can do. Extending further can bring even larger gains, such as using Pyarrow directly to process larger-than-memory data!
However, fully leveraging Arrow isn’t the point of this blog – this is no deep dive. If you’d like to explore further, make sure to check out the Apache Arrow project. For more resources or hands-on support with Apache Arrow, learn more about Voltron Data Enterprise Support subscription offerings.
Photo by Brett Sayles