Sep 08, 2022

Quick Wins: Reading CSVs in R with Apache Arrow

Keith Britt

Highway time-lapse

Apache Arrow is an admittedly complex software ecosystem that can take some time to wrap your mind around. But that learning and labor is worth it as once you understand you can use Apache Arrow to have a transformative impact on your data transfer and analysis. This increased performance can even make some problems that are currently intractable to you, suddenly… well… tractable.

However, Apache Arrow also offers some extremely beneficial features that require no background knowledge and very little effort. These features can make your life as a data analyst much easier, less frustrating, and allow you to get to the data you need with far less overhead. Below we’ll detail such an example in R that you can replicate over-and-over for very large publicly available datasets.

Reading CSVs in R

For this blog, we’ll be using the Google Mobility Global Data Report which was developed in response to the COVID-19 pandemic. This dataset is a comma separated value (CSV) file. The version we’ll use in these examples is 1.1 GB, was produced on August 11, 2022, and contains 10,925,814 rows of data spanning 15 columns for a total of 163,887,210 data elements.

Opening a CSV in R is easy and straightforward. Below is an example. All it takes is one line of code. No non-native libraries are needed.

StandardMobilityDF <- read.csv("Global_Mobility_Report.csv")

But, this process makes a row-major assumption, which means you’ll be reading each record/observation sequentially with no regard to the data types or structure of the fields in each record. Below is the output of an Rprof analysis of this operation running on a Linux-based laptop using a CPU with 6 cores and a GPU with 20 cores. While these aren’t catastrophic times, this is a relatively small dataset compared to some other publicly available datasets. So, a 19+ second runtime could quickly become onerous and even impossible as the number of rows scales into the billions.

$by.self
self.time self.pct total.time total.pct
"scan" 17.08 89.14 17.08 89.14
".External2" 2.08 10.86 2.08 10.86

$by.total
total.time total.pct self.time self.pct
"read.csv" 19.16 100.00 0.00 0.00
"read.table" 19.16 100.00 0.00 0.00
"scan" 17.08 89.14 17.08 89.14
".External2" 2.08 10.86 2.08 10.86
"type.convert.default" 2.08 10.86 0.00 0.00
"type.convert" 2.08 10.86 0.00 0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 19.16

Reading CSVs in R with Apache Arrow

Now, if we were to leverage Arrow to read the CSV column-by-column instead of row-by-row (allowing us to take advantage of modern processors’ SIMD performance bias), there are 3 simple steps. First, you must install the arrow package (sample code below). This can take a few minutes, but only needs to be done once and over time that install operation amortization drives the cost towards zero. Second, you must load the library before using it (sample code below). This must be repeated during every R session, but the processing cost is nominal after the install operation is complete.

install.packages("arrow")
library(arrow, warn.conflicts = FALSE)

Having installed and loaded the Arrow R library, our final step is to use the read_csv_arrow function in place of the native read.csv function as demonstrated below.

ArrowMobilityDF <- read_csv_arrow("Global_Mobility_Report.csv")

So what do you get in return for these two extra lines of code and using a different read function? Performance-wise, Rprof shows the entire process takes 2.16 seconds. That’s a mere 11.27% of the time it took to load the exact same CSV file into a dataframe using the native CSV read function.

$by.self
self.time self.pct total.time total.pct
"csv___TableReader__Read" 1.94 89.81 1.94 89.81
"Table__to_dataframe" 0.20 9.26 0.20 9.26
"io___InputStream__Close" 0.02 0.93 0.02 0.93

$by.total
total.time total.pct self.time self.pct
"" 2.16 100.00 0.00 0.00
"eval.parent" 2.16 100.00 0.00 0.00
"eval" 2.16 100.00 0.00 0.00
"read_csv_arrow" 2.16 100.00 0.00 0.00
"csv___TableReader__Read" 1.94 89.81 1.94 89.81
"doTryCatch" 1.94 89.81 0.00 0.00
"reader$Read" 1.94 89.81 0.00 0.00
"tryCatch" 1.94 89.81 0.00 0.00
"tryCatchList" 1.94 89.81 0.00 0.00
"tryCatchOne" 1.94 89.81 0.00 0.00
"Table__to_dataframe" 0.20 9.26 0.20 9.26
"as.data.frame.ArrowTabular" 0.20 9.26 0.00 0.00
"as.data.frame" 0.20 9.26 0.00 0.00
"x$to_data_frame" 0.20 9.26 0.00 0.00
"io___InputStream__Close" 0.02 0.93 0.02 0.93
"file$close" 0.02 0.93 0.00 0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 2.16

Yes, it Really was That Easy

Two extra lines of code, a function name change and now you’re able to read moderately sized CSVs much faster. In some cases, CSVs that may have been impossible to put in a dataframe before (because they contain billions and billions of rows) are now accessible to you. Notice how we didn’t even have to learn what Arrow was or how it creates these wonderful performance upgrades. We also didn’t even maximize the power of Arrow (for instance, we didn’t take advantage of the superior GPU processing efficiency that Arrow can provide).

Showing how advantageous Arrow can be with a very low barrier to use was the aim of this article, but Apache Arrow is so much more and will absolutely make ETL processes more efficient. Arrow opens up a world of previously inaccessible data analysis opportunities. If you’d like to dig in further, check out the open-source Apache Arrow project or explore what a Voltron Data Enterprise Support subscription can do for you.



Photo credit: Jonas Von Werne