Sep 13, 2022
Creating an Arrow Dataset
Françios Michonneau

This article originally appeared on François Michonneau’s website.
Getting Started with Apache Arrow
While getting started with Apache Arrow, I was intrigued by the variety of formats Arrow supports. Arrow tutorials tend to start with prepared datasets ready to be ingested by open_dataset()
. I wanted to explore what it takes to create a dataset for analysis with Arrow and understand the respective benefits of the different file formats it supports.
Arrow can read in a variety of formats: parquet
, arrow
(also known as ipc
and feather
)1, and text-based formats like csv (as well as tsv). Additionally, Arrow provides tools to convert between these formats.
The possibility to import datasets in a variety of formats is helpful because you are less constrained by the type of data you can start your analysis on. However, if you are building a dataset from scratch, which one should you choose?
To try to answer this question, we will use the {arrow}
R package to compare the amount of hard drive space these file formats use and the performance of a query in a multi-file dataset with these different formats. This is not a formal evaluation of the performance of Arrow or how best to optimize the partitioning of a dataset. Rather, it is a brief exploration of the tradeoffs that come with using the different datasets supported by Arrow. I also don’t explain the differences in the data structure of these different formats.
The Dataset
We will use data from http://cran-logs.rstudio.com/. This site gives you access to the log files for all hits to the CRAN2 mirror hosted by RStudio. For each day since October 1st, 2012, there is a compressed CSV file (file with the extension .csv.gz
) that records the downloaded packages. Each row contains the date, the time, the name of the R package downloaded, the R version used, the architecture (32-bit or 64-bit), the operating system, the country inferred from the IP address, and a daily unique identifier assigned to each IP address. This website also has similar data for the daily downloads of R itself, but I will not use this data here.
For this exploration, we will limit ourselves to a couple of months of data which will provide enough data for our purpose. We will download the data for the period from June 1st, 2022 to August 15th, 2022.
Arrow reads data that is split across multiple files. So, you can point open_dataset()
to a directory that contains all the files that make up your dataset. There is no need to loop over each file to build your dataset in memory. Splitting your datasets across multiple files can even make queries on your dataset faster, as only some of the files might need to be accessed to get the results needed. Depending on the type of queries you perform most often on your dataset, it is worth considering how best to partition your files to accelerate your analyses (but this is beyond the scope of this post). Here, the files are provided by date, and we will keep a time-based file organization.
We will use a Hive-style partitioning by year and month. We will have a directory for each year (there is only one year in our example) and, within it, a directory for each month. The directory is named according to the convention <variable_name>=<value>
. So we will want to organize the files as illustrated below:
└── year=2022
├── month=6
│ └── <data files>
├── month=7
│ └── <data files>
└── month=8
└── <data files>
Import the Data as it is Provided
library(arrow)
library(tidyverse)
library(fs)
library(bench)
The open_dataset()
function in the {arrow}
package can directly read compressed CSV files3 (with the extension .csv.gz) as they are provided on the RStudio CRAN logs website.
As a first step, we can download the files from the site and organize them using the Hive-style directory structure as shown above.
## Check that the date is really a date,
## and extract the year and month from it
parse_date <- function(date) {
stopifnot(
"`date` must be a date" = inherits(date, "Date"),
"provide only one date" = identical(length(date), 1L),
"date must be in the past" = date < Sys.Date()
)
list(
date_chr = as.character(date),
year = as.POSIXlt(date)$year + 1900L,
month = as.POSIXlt(date)$mon + 1L
)
}
## Download the data set for a given date from the RStudio CRAN log website.
## `date` is a single date for which we want the data
## `path` is where we want the data to live
download_daily_package_logs_csv <- function(date,
path = "~/datasets/cran-logs-csv") {
## build the URL for the download
date <- parse_date(date)
url <- paste0(
'http://cran-logs.rstudio.com/', date$year, '/', date$date_chr, '.csv.gz'
)
## build the path for the destination of the download
file <- file.path(
path,
paste0("year=", date$year),
paste0("month=", date$month),
paste0(date$date_chr, ".csv.gz")
)
## create the folder if it doesn't exist
if (!dir.exists(dirname(file))) {
dir.create(dirname(file), recursive = TRUE)
}
## download the file
message("Downloading data for ", date$date_chr, " ... ", appendLF = FALSE)
download.file(
url = url,
destfile = file,
method = "libcurl",
quiet = TRUE,
mode = "wb"
)
message("done.")
## quick check to make sure that the file was created
if (!file.exists(file)) {
stop("Download failed for ", date$date_chr, call. = FALSE)
}
## return the path
file
}
## build sequence of dates for which we want the data
dates_to_get <- seq(
as.Date("2022-06-01"),
as.Date("2022-08-15"),
by = "day"
)
## download the data
walk(dates_to_get, download_daily_package_logs_csv)
Let’s check the content of the folder that holds the data we downloaded:
~/datasets/cran-logs-csv/
└── year=2022
├── month=6
│ ├── 2022-06-01.csv.gz
│ ├── 2022-06-02.csv.gz
│ ├── 2022-06-03.csv.gz
│ ├── ...
│ └── 2022-06-30.csv.gz
├── month=7
│ ├── 2022-07-01.csv.gz
│ ├── 2022-07-02.csv.gz
│ ├── 2022-07-03.csv.gz
│ ├── ...
│ └── 2022-07-31.csv.gz
└── month=8
├── 2022-08-01.csv.gz
├── 2022-08-02.csv.gz
├── 2022-08-03.csv.gz
├── ...
└── 2022-08-15.csv.gz
We have one file for each day, placed in a folder corresponding to its month. We can now read this data using {arrow}
’s open_dataset()
function:
cran_logs_csv <- open_dataset(
"~/datasets/cran-logs-csv/",
format = "csv",
partitioning = c("year", "month")
)
cran_logs_csv
date: date32[day]
time: time32[s]
size: int64
r_version: string
r_arch: string
r_os: string
package: string
version: string
country: string
ip_id: int64
year: int32
month: int32
The partitioning is taken into consideration as the output shows that the dataset contains the variables year
and month
, which are not part of the data we downloaded. They are coming from the way we organized the downloaded files.
Convert to Arrow and Parquet Files
Now that we have the compressed CSV files on disk, and that we opened the dataset with open_dataset()
, we can convert it to the other file formats supported by Arrow using {arrow}
’s write_dataset()
function. We are going to convert our collection of .csv.gz files into the Arrow and Parquet formats.
## Convert the dataset into the Arrow format
write_dataset(
cran_logs_csv,
path = "~/datasets/cran-logs-arrow",
format = "arrow",
partitioning = c("year", "month")
)
## Convert the dataset into the Parquet format
write_dataset(
cran_logs_csv,
path = "~/datasets/cran-logs-parquet",
format = "parquet",
partitioning = c("year", "month")
)
Let’s inspect the content of the directories that contain these datasets.
fs::dir_tree("~/datasets/cran-logs-arrow/")
~/datasets/cran-logs-arrow/
└── year=2022
├── month=6
│ └── part-0.arrow
├── month=7
│ └── part-0.arrow
└── month=8
└── part-0.arrow
fs::dir_tree("~/datasets/cran-logs-parquet/")
~/datasets/cran-logs-parquet/
└── year=2022
├── month=6
│ └── part-0.parquet
├── month=7
│ └── part-0.parquet
└── month=8
└── part-0.parquet
These two directories have the same layout organized by year and month as with our CSV files given that we kept the same partitioning. The files within the directories have an extension that matches their file format. One difference is that there is a single file for each month. We used the default values for write_dataset()
and the number of rows for each month is smaller than the threshold this function uses to split the dataset into multiple files.
Comparison of the Different Formats
Let’s compare how much space these different file formats take on disk:
dataset_size <- function(path) {
fs::dir_info(path, recurse = TRUE) %>%
filter(type == "file") %>%
pull(size) %>%
sum()
}
tribble(
~ Format, ~ size,
"Compressed CSV", dataset_size("~/datasets/cran-logs-csv/"),
"Arrow", dataset_size("~/datasets/cran-logs-arrow/"),
"Parquet", dataset_size("~/datasets/cran-logs-parquet/")
)
# A tibble: 3 × 2
Format size
<chr> <fs::bytes>
1 Compressed CSV 5.01G
2 Arrow 29.67G
3 Parquet 5.06G
The Arrow format takes the most space with almost 30GB while both the compressed CSV and the Parquet files use about 5GB of hard drive.
We are now set up to compare the performance of doing computation of these different dataset formats.
Let’s open these datasets with the different formats:
cran_logs_csv <- open_dataset("~/datasets/cran-logs-csv/", format = "csv")
cran_logs_arrow <- open_dataset("~/datasets/cran-logs-arrow/", format = "arrow")
cran_logs_parquet <- open_dataset("~/datasets/cran-logs-parquet/", format = "parquet")
We will compare how long it takes for Apache Arrow to compute the 10 most downloaded packages in the time period our dataset covers using each file format.
top_10_packages <- function(data) {
data %>%
count(package, sort = TRUE) %>%
head(10) %>%
mutate(n_million_downloads = n/1e6) %>%
select( - n) %>%
collect()
}
bench::mark(
top_10_packages(cran_logs_csv),
top_10_packages(cran_logs_arrow),
top_10_packages(cran_logs_parquet)
)
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 3 × 6
expression min median itr/se…¹ mem_al…² gc/se…³
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:by> <dbl>
1 top_10_packages(cran_logs_csv) 29.57s 29.57s 0.0338 8.19MB 0
2 top_10_packages(cran_logs_arrow) 2.1s 2.1s 0.475 165.39KB 0.475
3 top_10_packages(cran_logs_parquet) 3.32s 3.32s 0.301 137.11KB 0
# ... with abbreviated variable names ¹`itr/sec`, ²mem_alloc, ³`gc/sec`
While it takes about 4 seconds to perform this task on the Arrow or Parquet files, it takes more than 30 seconds to do it on the CSV files.
Conclusion
Having Arrow point directly to the folder of a compressed CSV file might be the most convenient, but it comes with a high-performance cost. Arrow and Parquet have similar performance, but the Parquet files take less space on disk and are more suitable for long-term storage. This is why large datasets like the NYC taxi data is distributed as a series of Parquet files.
If you would like to learn more about the different formats, check out the Arrow workshop (especially Part 3: Data Storage) that Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt taught at useR!2022.
For more support, learn how a Voltron Data Enterprise Support subscription can help accelerate your success with Apache Arrow.
Post Scriptum
Details
##Expand for Session Info
sessioninfo::session_info()
- Session info ───────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os Ubuntu 22.04.1 LTS
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Paris
date 2022-08-19
pandoc NA (via rmarkdown)
- Packages ─────────────────────────────────────
package * version date (UTC) lib source
arrow * 9.0.0 2022-08-10 [1] CRAN (R 4.2.1)
assertthat 0.2.1 2019-03-21 [1] RSPM
backports 1.4.1 2021-12-13 [1] RSPM
bench * 1.1.2 2021-11-30 [1] RSPM
bit 4.0.4 2020-08-04 [1] RSPM
bit64 4.0.5 2020-08-30 [1] RSPM
broom 1.0.0 2022-07-01 [1] RSPM
cellranger 1.1.0 2016-07-27 [1] RSPM
cli 3.3.0 2022-04-25 [1] RSPM (R 4.2.0)
colorspace 2.0-3 2022-02-21 [1] RSPM
crayon 1.5.1 2022-03-26 [1] RSPM
DBI 1.1.3 2022-06-18 [1] RSPM
dbplyr 2.2.1 2022-06-27 [1] RSPM
digest 0.6.29 2021-12-01 [1] RSPM
dplyr * 1.0.9 2022-04-28 [1] RSPM
ellipsis 0.3.2 2021-04-29 [1] RSPM
evaluate 0.15 2022-02-18 [1] RSPM
fansi 1.0.3 2022-03-24 [1] RSPM
fastmap 1.1.0 2021-01-25 [1] RSPM
forcats * 0.5.1 2021-01-27 [1] RSPM
fs * 1.5.2 2021-12-08 [1] RSPM
gargle 1.2.0 2021-07-02 [1] RSPM
generics 0.1.3 2022-07-05 [1] RSPM
ggplot2 * 3.3.6 2022-05-03 [1] RSPM
glue 1.6.2 2022-02-24 [1] RSPM (R 4.2.0)
googledrive 2.0.0 2021-07-08 [1] RSPM
googlesheets4 1.0.0 2021-07-21 [1] RSPM
gtable 0.3.0 2019-03-25 [1] RSPM
haven 2.5.0 2022-04-15 [1] RSPM
hms 1.1.1 2021-09-26 [1] RSPM
htmltools 0.5.3 2022-07-18 [1] RSPM
httr 1.4.3 2022-05-04 [1] RSPM
jsonlite 1.8.0 2022-02-22 [1] RSPM
knitr 1.39 2022-04-26 [1] RSPM
lifecycle 1.0.1 2021-09-24 [1] RSPM
lubridate 1.8.0 2021-10-07 [1] RSPM
magrittr 2.0.3 2022-03-30 [1] RSPM
modelr 0.1.8 2020-05-19 [1] RSPM
munsell 0.5.0 2018-06-12 [1] RSPM
pillar 1.8.0 2022-07-18 [1] RSPM
pkgconfig 2.0.3 2019-09-22 [1] RSPM
purrr * 0.3.4 2020-04-17 [1] RSPM
R6 2.5.1 2021-08-19 [1] RSPM
readr * 2.1.2 2022-01-30 [1] RSPM
readxl 1.4.0 2022-03-28 [1] RSPM
reprex 2.0.1 2021-08-05 [1] RSPM
rlang 1.0.4 2022-07-12 [1] RSPM (R 4.2.0)
rmarkdown 2.14 2022-04-25 [1] RSPM
rvest 1.0.2 2021-10-16 [1] RSPM
scales 1.2.0 2022-04-13 [1] RSPM
sessioninfo 1.2.2 2021-12-06 [1] RSPM
stringi 1.7.8 2022-07-11 [1] RSPM
stringr * 1.4.0 2019-02-10 [1] RSPM
tibble * 3.1.8 2022-07-22 [1] RSPM
tidyr * 1.2.0 2022-02-01 [1] RSPM
tidyselect 1.1.2 2022-02-21 [1] RSPM
tidyverse * 1.3.2 2022-07-18 [1] RSPM
tzdb 0.3.0 2022-03-28 [1] RSPM
utf8 1.2.2 2021-07-24 [1] RSPM
vctrs 0.4.1 2022-04-13 [1] RSPM
withr 2.5.0 2022-03-03 [1] RSPM
xfun 0.31 2022-05-10 [1] RSPM
xml2 1.3.3 2021-11-30 [1] RSPM
yaml 2.3.5 2022-02-21 [1] RSPM
[1] /home/francois/.R-library
[2] /usr/lib/R/library
- Feather was the first iteration of the file format (v1), the Arrow Interprocess Communication (IPC) file format is the newer version (v2) and has many new features.
- Comprehensive R Archive Network, the repository for the R packages
- since Arrow 9.0.0
Photo credit: Christopher Burns