Creating an Apache Arrow Dataset

Françios Michonneau · Sep 13, 2022

This article originally appeared on François Michonneau’s website.

Getting Started with Apache Arrow

While getting started with Apache Arrow, I was intrigued by the variety of formats Arrow supports. Arrow tutorials tend to start with prepared datasets ready to be ingested by open_dataset(). I wanted to explore what it takes to create a dataset for analysis with Arrow and understand the respective benefits of the different file formats it supports.

Arrow can read in a variety of formats: parquet, arrow (also known as ipc and feather)1, and text-based formats like csv (as well as tsv). Additionally, Arrow provides tools to convert between these formats.

The possibility to import datasets in a variety of formats is helpful because you are less constrained by the type of data you can start your analysis on. However, if you are building a dataset from scratch, which one should you choose?

To try to answer this question, we will use the {arrow} R package to compare the amount of hard drive space these file formats use and the performance of a query in a multi-file dataset with these different formats. This is not a formal evaluation of the performance of Arrow or how best to optimize the partitioning of a dataset. Rather, it is a brief exploration of the tradeoffs that come with using the different datasets supported by Arrow. I also don’t explain the differences in the data structure of these different formats.

The Dataset

We will use data from http://cran-logs.rstudio.com/. This site gives you access to the log files for all hits to the CRAN2 mirror hosted by RStudio. For each day since October 1st, 2012, there is a compressed CSV file (file with the extension .csv.gz) that records the downloaded packages. Each row contains the date, the time, the name of the R package downloaded, the R version used, the architecture (32-bit or 64-bit), the operating system, the country inferred from the IP address, and a daily unique identifier assigned to each IP address. This website also has similar data for the daily downloads of R itself, but I will not use this data here.

For this exploration, we will limit ourselves to a couple of months of data which will provide enough data for our purpose. We will download the data for the period from June 1st, 2022 to August 15th, 2022.

Arrow reads data that is split across multiple files. So, you can point open_dataset() to a directory that contains all the files that make up your dataset. There is no need to loop over each file to build your dataset in memory. Splitting your datasets across multiple files can even make queries on your dataset faster, as only some of the files might need to be accessed to get the results needed. Depending on the type of queries you perform most often on your dataset, it is worth considering how best to partition your files to accelerate your analyses (but this is beyond the scope of this post). Here, the files are provided by date, and we will keep a time-based file organization.

We will use a Hive-style partitioning by year and month. We will have a directory for each year (there is only one year in our example) and, within it, a directory for each month. The directory is named according to the convention <variable_name>=<value>. So we will want to organize the files as illustrated below:

└── year=2022
    ├── month=6
    │   └── <data files>
    ├── month=7
    │   └── <data files>
    └── month=8
        └── <data files>

Import the Data as it is Provided

library(arrow)
library(tidyverse)
library(fs)
library(bench)

The open_dataset() function in the {arrow} package can directly read compressed CSV files3 (with the extension .csv.gz) as they are provided on the RStudio CRAN logs website.

As a first step, we can download the files from the site and organize them using the Hive-style directory structure as shown above.

## Check that the date is really a date,
## and extract the year and month from it
parse_date <- function(date) {
  stopifnot(
    "`date` must be a date" = inherits(date, "Date"),
    "provide only one date" = identical(length(date), 1L),
    "date must be in the past" = date < Sys.Date()
  )
  list(
    date_chr = as.character(date),
    year = as.POSIXlt(date)$year + 1900L, 
    month = as.POSIXlt(date)$mon + 1L
  )
}

## Download the data set for a given date from the RStudio CRAN log website.
## `date` is a single date for which we want the data
## `path` is where we want the data to live
download_daily_package_logs_csv <- function(date,
                                            path = "~/datasets/cran-logs-csv") {

  ## build the URL for the download
  date <- parse_date(date)
  url <- paste0(
    'http://cran-logs.rstudio.com/', date$year, '/', date$date_chr, '.csv.gz'
  )

  ## build the path for the destination of the download
  file <- file.path(
    path,
    paste0("year=", date$year),
    paste0("month=", date$month),
    paste0(date$date_chr, ".csv.gz")
  )

  ## create the folder if it doesn't exist
  if (!dir.exists(dirname(file))) {
    dir.create(dirname(file), recursive = TRUE)
  }

  ## download the file
  message("Downloading data for ", date$date_chr, " ... ", appendLF = FALSE)
  download.file(
    url = url,
    destfile = file,
    method = "libcurl",
    quiet = TRUE,
    mode = "wb"
  )
  message("done.")

  ## quick check to make sure that the file was created
  if (!file.exists(file)) {
    stop("Download failed for ", date$date_chr, call. = FALSE)
  }

  ## return the path
  file
}
## build sequence of dates for which we want the data
dates_to_get <- seq(
  as.Date("2022-06-01"),
  as.Date("2022-08-15"),
  by = "day"
)

## download the data
walk(dates_to_get, download_daily_package_logs_csv)

Let’s check the content of the folder that holds the data we downloaded:

~/datasets/cran-logs-csv/
└── year=2022
    ├── month=6
    │   ├── 2022-06-01.csv.gz
    │   ├── 2022-06-02.csv.gz
    │   ├── 2022-06-03.csv.gz
    │   ├── ...
    │   └── 2022-06-30.csv.gz
    ├── month=7
    │   ├── 2022-07-01.csv.gz
    │   ├── 2022-07-02.csv.gz
    │   ├── 2022-07-03.csv.gz
    │   ├── ...
    │   └── 2022-07-31.csv.gz
    └── month=8
        ├── 2022-08-01.csv.gz
        ├── 2022-08-02.csv.gz
        ├── 2022-08-03.csv.gz
        ├── ...
        └── 2022-08-15.csv.gz

We have one file for each day, placed in a folder corresponding to its month. We can now read this data using {arrow}’s open_dataset() function:

cran_logs_csv <- open_dataset(
  "~/datasets/cran-logs-csv/",
  format = "csv",
  partitioning = c("year", "month")
)
cran_logs_csv
date: date32[day]
time: time32[s]
size: int64
r_version: string
r_arch: string
r_os: string
package: string
version: string
country: string
ip_id: int64
year: int32
month: int32

The partitioning is taken into consideration as the output shows that the dataset contains the variables year and month, which are not part of the data we downloaded. They are coming from the way we organized the downloaded files.

Convert to Arrow and Parquet Files

Now that we have the compressed CSV files on disk, and that we opened the dataset with open_dataset(), we can convert it to the other file formats supported by Arrow using {arrow}’s write_dataset() function. We are going to convert our collection of .csv.gz files into the Arrow and Parquet formats.

## Convert the dataset into the Arrow format
write_dataset(
  cran_logs_csv,
  path = "~/datasets/cran-logs-arrow",
  format = "arrow",
  partitioning = c("year", "month")
)

## Convert the dataset into the Parquet format
write_dataset(
  cran_logs_csv,
  path = "~/datasets/cran-logs-parquet",
  format = "parquet",
  partitioning = c("year", "month")
)

Let’s inspect the content of the directories that contain these datasets.

fs::dir_tree("~/datasets/cran-logs-arrow/")
~/datasets/cran-logs-arrow/
└── year=2022
    ├── month=6
    │   └── part-0.arrow
    ├── month=7
    │   └── part-0.arrow
    └── month=8
        └── part-0.arrow
fs::dir_tree("~/datasets/cran-logs-parquet/")
~/datasets/cran-logs-parquet/
└── year=2022
    ├── month=6
    │   └── part-0.parquet
    ├── month=7
    │   └── part-0.parquet
    └── month=8
        └── part-0.parquet

These two directories have the same layout organized by year and month as with our CSV files given that we kept the same partitioning. The files within the directories have an extension that matches their file format. One difference is that there is a single file for each month. We used the default values for write_dataset() and the number of rows for each month is smaller than the threshold this function uses to split the dataset into multiple files.

Comparison of the Different Formats

Let’s compare how much space these different file formats take on disk:

dataset_size <- function(path) {
  fs::dir_info(path, recurse = TRUE) %>%
    filter(type == "file") %>%
    pull(size) %>%
    sum()
}

tribble(
  ~ Format, ~ size,
  "Compressed CSV", dataset_size("~/datasets/cran-logs-csv/"),
  "Arrow", dataset_size("~/datasets/cran-logs-arrow/"),
  "Parquet", dataset_size("~/datasets/cran-logs-parquet/")
)
# A tibble: 3 × 2
  Format                size
  <chr>          <fs::bytes>
1 Compressed CSV       5.01G
2 Arrow               29.67G
3 Parquet              5.06G

The Arrow format takes the most space with almost 30GB while both the compressed CSV and the Parquet files use about 5GB of hard drive.

We are now set up to compare the performance of doing computation of these different dataset formats.

Let’s open these datasets with the different formats:

cran_logs_csv <- open_dataset("~/datasets/cran-logs-csv/", format = "csv")
cran_logs_arrow <- open_dataset("~/datasets/cran-logs-arrow/", format = "arrow")
cran_logs_parquet <- open_dataset("~/datasets/cran-logs-parquet/", format = "parquet")

We will compare how long it takes for Apache Arrow to compute the 10 most downloaded packages in the time period our dataset covers using each file format.

top_10_packages <- function(data) {
  data %>%
    count(package, sort = TRUE) %>%
    head(10) %>%
    mutate(n_million_downloads = n/1e6) %>%
    select( - n) %>% 
    collect()
}

bench::mark(
  top_10_packages(cran_logs_csv),
  top_10_packages(cran_logs_arrow),
  top_10_packages(cran_logs_parquet)
)
Warning: Some expressions had a GC in every iteration; so filtering is disabled.

# A tibble: 3 × 6
  expression                              min   median itr/se…¹ mem_al…² gc/se…³
  <bch:expr>                         <bch:tm> <bch:tm>    <dbl> <bch:by>   <dbl>
1 top_10_packages(cran_logs_csv)       29.57s   29.57s   0.0338   8.19MB   0    
2 top_10_packages(cran_logs_arrow)       2.1s     2.1s   0.475  165.39KB   0.475
3 top_10_packages(cran_logs_parquet)    3.32s    3.32s   0.301  137.11KB   0    
# ... with abbreviated variable names ¹​`itr/sec`, ²​mem_alloc, ³​`gc/sec`

While it takes about 4 seconds to perform this task on the Arrow or Parquet files, it takes more than 30 seconds to do it on the CSV files.

Conclusion

Having Arrow point directly to the folder of a compressed CSV file might be the most convenient, but it comes with a high-performance cost. Arrow and Parquet have similar performance, but the Parquet files take less space on disk and are more suitable for long-term storage. This is why large datasets like the NYC taxi data is distributed as a series of Parquet files.

If you would like to learn more about the different formats, check out the Arrow workshop (especially Part 3: Data Storage) that Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt taught at useR!2022.

For more support, learn how a Voltron Data Enterprise Support subscription can help accelerate your success with Apache Arrow.

Post Scriptum

Details
##Expand for Session Info

sessioninfo::session_info()
- Session info ───────────────────────────────────
  setting  value
  version  R version 4.2.1 (2022-06-23)
  os       Ubuntu 22.04.1 LTS
  system   x86_64, linux-gnu
  ui       X11
  language en_US
  collate  en_US.UTF-8
  ctype    en_US.UTF-8
  tz       Europe/Paris
  date     2022-08-19
  pandoc   NA (via rmarkdown)

- Packages ─────────────────────────────────────
  package       * version date (UTC) lib source
  arrow         * 9.0.0   2022-08-10 [1] CRAN (R 4.2.1)
  assertthat      0.2.1   2019-03-21 [1] RSPM
  backports       1.4.1   2021-12-13 [1] RSPM
  bench         * 1.1.2   2021-11-30 [1] RSPM
  bit             4.0.4   2020-08-04 [1] RSPM
  bit64           4.0.5   2020-08-30 [1] RSPM
  broom           1.0.0   2022-07-01 [1] RSPM
  cellranger      1.1.0   2016-07-27 [1] RSPM
  cli             3.3.0   2022-04-25 [1] RSPM (R 4.2.0)
  colorspace      2.0-3   2022-02-21 [1] RSPM
  crayon          1.5.1   2022-03-26 [1] RSPM
  DBI             1.1.3   2022-06-18 [1] RSPM
  dbplyr          2.2.1   2022-06-27 [1] RSPM
  digest          0.6.29  2021-12-01 [1] RSPM
  dplyr         * 1.0.9   2022-04-28 [1] RSPM
  ellipsis        0.3.2   2021-04-29 [1] RSPM
  evaluate        0.15    2022-02-18 [1] RSPM
  fansi           1.0.3   2022-03-24 [1] RSPM
  fastmap         1.1.0   2021-01-25 [1] RSPM
  forcats       * 0.5.1   2021-01-27 [1] RSPM
  fs            * 1.5.2   2021-12-08 [1] RSPM
  gargle          1.2.0   2021-07-02 [1] RSPM
  generics        0.1.3   2022-07-05 [1] RSPM
  ggplot2       * 3.3.6   2022-05-03 [1] RSPM
  glue            1.6.2   2022-02-24 [1] RSPM (R 4.2.0)
  googledrive     2.0.0   2021-07-08 [1] RSPM
  googlesheets4   1.0.0   2021-07-21 [1] RSPM
  gtable          0.3.0   2019-03-25 [1] RSPM
  haven           2.5.0   2022-04-15 [1] RSPM
  hms             1.1.1   2021-09-26 [1] RSPM
  htmltools       0.5.3   2022-07-18 [1] RSPM
  httr            1.4.3   2022-05-04 [1] RSPM
  jsonlite        1.8.0   2022-02-22 [1] RSPM
  knitr           1.39    2022-04-26 [1] RSPM
  lifecycle       1.0.1   2021-09-24 [1] RSPM
  lubridate       1.8.0   2021-10-07 [1] RSPM
  magrittr        2.0.3   2022-03-30 [1] RSPM
  modelr          0.1.8   2020-05-19 [1] RSPM
  munsell         0.5.0   2018-06-12 [1] RSPM
  pillar          1.8.0   2022-07-18 [1] RSPM
  pkgconfig       2.0.3   2019-09-22 [1] RSPM
  purrr         * 0.3.4   2020-04-17 [1] RSPM
  R6              2.5.1   2021-08-19 [1] RSPM
  readr         * 2.1.2   2022-01-30 [1] RSPM
  readxl          1.4.0   2022-03-28 [1] RSPM
  reprex          2.0.1   2021-08-05 [1] RSPM
  rlang           1.0.4   2022-07-12 [1] RSPM (R 4.2.0)
  rmarkdown       2.14    2022-04-25 [1] RSPM
  rvest           1.0.2   2021-10-16 [1] RSPM
  scales          1.2.0   2022-04-13 [1] RSPM
  sessioninfo     1.2.2   2021-12-06 [1] RSPM
  stringi         1.7.8   2022-07-11 [1] RSPM
  stringr       * 1.4.0   2019-02-10 [1] RSPM
  tibble        * 3.1.8   2022-07-22 [1] RSPM
  tidyr         * 1.2.0   2022-02-01 [1] RSPM
  tidyselect      1.1.2   2022-02-21 [1] RSPM
  tidyverse     * 1.3.2   2022-07-18 [1] RSPM
  tzdb            0.3.0   2022-03-28 [1] RSPM
  utf8            1.2.2   2021-07-24 [1] RSPM
  vctrs           0.4.1   2022-04-13 [1] RSPM
  withr           2.5.0   2022-03-03 [1] RSPM
  xfun            0.31    2022-05-10 [1] RSPM
  xml2            1.3.3   2021-11-30 [1] RSPM
  yaml            2.3.5   2022-02-21 [1] RSPM

 [1] /home/francois/.R-library
 [2] /usr/lib/R/library
 
 
  1. Feather was the first iteration of the file format (v1), the Arrow Interprocess Communication (IPC) file format is the newer version (v2) and has many new features.
  2. Comprehensive R Archive Network, the repository for the R packages
  3. since Arrow 9.0.0


Photo credit: Christopher Burns