While getting started with Apache Arrow, I was intrigued by the variety of formats Arrow supports. Arrow tutorials tend to start with prepared datasets ready to be ingested by open_dataset(). I wanted to explore what it takes to create a dataset for analysis with Arrow and understand the respective benefits of the different file formats it supports.
Arrow can read in a variety of formats: parquet, arrow (also known as ipc and feather)1, and text-based formats like csv (as well as tsv). Additionally, Arrow provides tools to convert between these formats.
The possibility to import datasets in a variety of formats is helpful because you are less constrained by the type of data you can start your analysis on. However, if you are building a dataset from scratch, which one should you choose?
To try to answer this question, we will use the {arrow} R package to compare the amount of hard drive space these file formats use and the performance of a query in a multi-file dataset with these different formats. This is not a formal evaluation of the performance of Arrow or how best to optimize the partitioning of a dataset. Rather, it is a brief exploration of the tradeoffs that come with using the different datasets supported by Arrow. I also don’t explain the differences in the data structure of these different formats.
The Dataset
We will use data from http://cran-logs.rstudio.com/. This site gives you access to the log files for all hits to the CRAN2 mirror hosted by RStudio. For each day since October 1st, 2012, there is a compressed CSV file (file with the extension .csv.gz) that records the downloaded packages. Each row contains the date, the time, the name of the R package downloaded, the R version used, the architecture (32-bit or 64-bit), the operating system, the country inferred from the IP address, and a daily unique identifier assigned to each IP address. This website also has similar data for the daily downloads of R itself, but I will not use this data here.
For this exploration, we will limit ourselves to a couple of months of data which will provide enough data for our purpose. We will download the data for the period from June 1st, 2022 to August 15th, 2022.
Arrow reads data that is split across multiple files. So, you can point open_dataset() to a directory that contains all the files that make up your dataset. There is no need to loop over each file to build your dataset in memory. Splitting your datasets across multiple files can even make queries on your dataset faster, as only some of the files might need to be accessed to get the results needed. Depending on the type of queries you perform most often on your dataset, it is worth considering how best to partition your files to accelerate your analyses (but this is beyond the scope of this post). Here, the files are provided by date, and we will keep a time-based file organization.
We will use a Hive-style partitioning by year and month. We will have a directory for each year (there is only one year in our example) and, within it, a directory for each month. The directory is named according to the convention <variable_name>=<value>. So we will want to organize the files as illustrated below:
Import the Data as it is Provided
The open_dataset() function in the {arrow} package can directly read compressed CSV files3 (with the extension .csv.gz) as they are provided on the RStudio CRAN logs website.
As a first step, we can download the files from the site and organize them using the Hive-style directory structure as shown above.
Let’s check the content of the folder that holds the data we downloaded:
We have one file for each day, placed in a folder corresponding to its month. We can now read this data using {arrow}’s open_dataset() function:
The partitioning is taken into consideration as the output shows that the dataset contains the variables year and month, which are not part of the data we downloaded. They are coming from the way we organized the downloaded files.
Convert to Arrow and Parquet Files
Now that we have the compressed CSV files on disk, and that we opened the dataset with open_dataset(), we can convert it to the other file formats supported by Arrow using {arrow}’s write_dataset() function. We are going to convert our collection of .csv.gz files into the Arrow and Parquet formats.
Let’s inspect the content of the directories that contain these datasets.
These two directories have the same layout organized by year and month as with our CSV files given that we kept the same partitioning. The files within the directories have an extension that matches their file format. One difference is that there is a single file for each month. We used the default values for write_dataset() and the number of rows for each month is smaller than the threshold this function uses to split the dataset into multiple files.
Comparison of the Different Formats
Let’s compare how much space these different file formats take on disk:
The Arrow format takes the most space with almost 30GB while both the compressed CSV and the Parquet files use about 5GB of hard drive.
We are now set up to compare the performance of doing computation of these different dataset formats.
Let’s open these datasets with the different formats:
We will compare how long it takes for Apache Arrow to compute the 10 most downloaded packages in the time period our dataset covers using each file format.
While it takes about 4 seconds to perform this task on the Arrow or Parquet files, it takes more than 30 seconds to do it on the CSV files.
Conclusion
Having Arrow point directly to the folder of a compressed CSV file might be the most convenient, but it comes with a high-performance cost. Arrow and Parquet have similar performance, but the Parquet files take less space on disk and are more suitable for long-term storage. This is why large datasets like the NYC taxi data is distributed as a series of Parquet files.
If you would like to learn more about the different formats, check out the Arrow workshop (especially Part 3: Data Storage) that Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt taught at useR!2022.
For more support, learn how a Voltron Data Enterprise Support subscription can help accelerate your success with Apache Arrow.
Post Scriptum
Details
Feather was the first iteration of the file format (v1), the Arrow
Interprocess Communication (IPC) file format is the newer version
(v2) and has many new features.
Comprehensive R Archive Network, the repository for the R packages