Sep 06, 2022
Getting Started with Apache Arrow in R
Danielle Navarro, Jonathan Keane, Stephanie Hazlitt
Multi-language tools allow users the freedom to work where they want, how they want, unleashing greater productivity. We design and build composable data systems that enable users to write code once, in any language, on any device, and at any scale to develop pipelines faster.
Ever tried to work with data too large for your R session’s memory? Familiar with Error: R cannot
allocate memory or R Session Aborted 💣? Or your R code is running, but taking more and more time to complete as your data gets larger? Well, you are not alone. It is becoming commonplace for data scientists to work with larger-than-memory data, and these issues become more frequent as your data grows.
How do we fix these problems? What tools are available for an R user who wants to solve them without necessarily learning a new data analysis syntax?
An Apache Arrow Tutorial
Meet Apache Arrow, a multi-language toolbox for working with larger-than-memory data. Started in 2016, the Arrow project is designed to allow data scientists to work efficiently with very large data sets in the programming language of their choice. To assist data analysts who work in R and want to learn more about Arrow, we’ve put together a tutorial on Larger-Than-Memory Data Workflows with Apache Arrow. On the website you’ll find a written tutorial, slides, and exercises designed to gently guide you along the learning journey. The content is aimed at learners who are new to Arrow but experienced with the R programming language. We first presented the material back in June to a group of 50 learners in a live online workshop that formed part of the useR! 2022 conference. However, the written content is designed to stand on its own. So if you’re having challenges working with larger-than-memory data in R, then the written tutorial and slide deck may be of interest to you.
What Does the Tutorial Cover?
The Larger-Than-Memory Data Workflows with Apache Arrow tutorial introduces the Arrow for R package, which provides R users a mature interface to the Apache Arrow toolbox. The tutorial is organized in three main sections:
- First, it walks learners through interoperable data file formats like Parquet or Feather for more efficient data storage and access–but don’t worry, you can also use csv files with Arrow.
- Second, it covers how to engineer and process larger-than-memory files and multi-file datasets with familiar R dplyr syntax.
- And last, but not least, it covers how to exercise fine control over data types to avoid common data pipeline problems. The goal of the tutorial is to help new-to-Arrow R users overcome some common struggles when working with large data in R.
Where Should a New Arrow User Look Next?
Okay, so you’re done with the tutorial and are looking for more. Where should you go next? New resources for Apache Arrow are being released regularly. While not complete, here is a list of some of our favorites:
- The Arrow for R cheatsheet is a short, handy guide to using Arrow in R.
- The Apache Arrow R Cookbook contains solutions to common data analysis problems.
- There are many articles included with the Arrow for R package documentation that provide deep dives into key topic areas.
If you like a video format, you can sit back and watch Doing More with Data: An Introduction to Arrow for R Users, a recent talk on the same topic from The Data Thread virtual conference.
If you’re working within the Apache Arrow ecosystem, we’re here to support you. Check out Voltron Data Enterprise Support subscription options today.
Photo credit: Hunter Harritt