Voltron Data Logo
About

Learn more about our company

Contact Us

Get in touch with our team

Theseus

  • How it Works

    Core concepts and architecture overview

  • Control Plane

    Kubernetes deployment guide and best practices

  • Query Profiler

    Analyze and optimize query performance

Arrow
Loading...

In-memory columnar data processing

Ibis
Loading...

Python dataframe API for multiple backends

RAPIDS
Loading...

GPU-accelerated data science and analytics

Dev Blog

Latest updates and technical insights

Benchmarks Report

Read about our 2024 benchmarks for our data engine, Theseus.

The Composable Codex

A 5-part guide to understanding composable

Try Theseus

Product

  • How it Works
  • Control Plane
  • Query Profiler

Resources

  • Blog
  • Composable Codex
  • Benchmarks

Getting Started

  • Test Drive

Theseus

Built for AI workloads, Theseus is a high-performance SQL engine with GPU acceleration.

© 2021-2025 Voltron Data, Inc. All rights reserved.

Terms of ServicePrivacy PolicyCookie Policy
Voltron Data Logo
About

Learn more about our company

Contact Us

Get in touch with our team

Theseus

  • How it Works

    Core concepts and architecture overview

  • Control Plane

    Kubernetes deployment guide and best practices

  • Query Profiler

    Analyze and optimize query performance

Arrow
Loading...

In-memory columnar data processing

Ibis
Loading...

Python dataframe API for multiple backends

RAPIDS
Loading...

GPU-accelerated data science and analytics

Dev Blog

Latest updates and technical insights

Benchmarks Report

Read about our 2024 benchmarks for our data engine, Theseus.

The Composable Codex

A 5-part guide to understanding composable

Try Theseus

Quick Wins: Accelerating pandas CSV Reading with Apache Arrow

K

Kae Suarez

October 6, 2022
Quick Wins: Accelerating pandas CSV Reading with Apache Arrow

Apache Arrow offers large performance gains and extends what you can solve on a given system. However, it can be intimidating to start using. In this series, we explore low-difficulty, high-impact routes to leverage Apache Arrow in your data analysis workflow– often with only a couple extra lines of code!

Previously, we explored how a couple lines of code can significantly increase CSV file intake performance in R. This time, we’ll do the same in Python with pandas and show the gains possible from including Apache Arrow in your software stack.

The Status Quo: Reading CSVs in Python with pandas

If you work with data in Python, there’s a good chance that pandas may come into the mix. Be it for exploration or full-on processing, pandas is a popular way to look at data – and data often enters pandas in the form of a CSV being read with its read_csv() utility. In small cases, much like any file I/O, this has a negligible impact on overall runtime. However, interesting data is often large, and this means that reading with pandas can take a noticeable amount of time.

Instead of just saying that, though, let’s prove it. We’ll use the charmingly-named “100 million data” dataset, which includes trade statistics from Japan from 1988 to 2019. This data is mostly of interest here due to being 4.54 GB of CSV. For execution, I’m using a Mac M2 laptop with 16 GB of RAM. The following cell reads this CSV and times the process:

python
%%time
pdf = pandas.read_csv("custom_1988_2020.csv", engine="c")
out:
CPU times: user 21.2 s, sys: 5.01 s, total: 26.2 s
Wall time: 29.1 s

Of course, pandas is good at offering power in small commands – the code is quick and to the point. The time spent running, on the other hand? Thirty seconds isn’t long for a simulation, but when working interactively, or iteratively modifying and re-running a script, this can add up quickly. Certainly, in some cases, it is possible to shrink the problem by splitting up the CSV, but let’s say that we want all these records at once. Could we do better, without giving up simplicity?

The Supercharged Flow: Reading CSVs in pandas with Arrow

The standard pandas read_csv() is row-major in nature and single-threaded regardless of system configuration. Arrow reads in a column-major form, which better leverages SIMD processors than row-major reading can and has a heavy focus on multi-threading.

But how do weget Apache Arrow to our pandas code?

Turns out, that work is already done! While experimental,pandas offers the ability to use Pyarrow to read a CSV, by specifying one argument. This lets us leverage the benefits of column-major reads, as well as multi-threading. The code looks like the following:

python
%%time
adf = pandas.read_csv("custom_1988_2020.csv", engine="pyarrow")

That certainly is just one string argument. What do we get for it?

python
CPU times: user 20.1 s, sys: 7.3 s, total: 27.4 s
Wall time: 6.13 s

A 5X speedup. While still noticeable, six seconds is much more tolerable than half a minute, and this difference is only more noticeable when moving to larger data – or running more than once, which happens often in data analysis.

Conclusion

With only 18 characters, and not even a manual import of Arrow, you can now read CSVs much faster. This much gain is nice, especially so when you can stay in your existing pandas-based process, and without even attempting to leverage the rest of what Arrow can do. Extending further can bring even larger gains, such as using Pyarrow directly to process larger-than-memory data!

However, fully leveraging Arrow isn’t the point of this blog – this is no deep dive. If you’d like to explore further, make sure to check out the Apache Arrow project. For more resources or hands-on support with Apache Arrow, learn more about Voltron Data Enterprise Support subscription offerings.

Photo by Brett Sayles

Product

  • How it Works
  • Control Plane
  • Query Profiler

Resources

  • Blog
  • Composable Codex
  • Benchmarks

Getting Started

  • Test Drive

Theseus

Built for AI workloads, Theseus is a high-performance SQL engine with GPU acceleration.

© 2021-2025 Voltron Data, Inc. All rights reserved.

Terms of ServicePrivacy PolicyCookie Policy