Ever tried to work with data too large for your R session’s memory? Familiar with
Error: R cannot allocate memory or R Session Aborted 💣? Or your R code is running, but taking more and more time to complete as your data gets larger? Well, you are not alone. It is becoming commonplace for data scientists to work with larger-than-memory data, and these issues become more frequent as your data grows.
How do we fix these problems? What tools are available for an R user who wants to solve them without necessarily learning a new data analysis syntax?
Meet Apache Arrow, a multi-language toolbox for working with larger-than-memory data. Started in 2016, the Arrow project is designed to allow data scientists to work efficiently with very large data sets in the programming language of their choice. To assist data analysts who work in R and want to learn more about Arrow, we’ve put together a tutorial on Larger-Than-Memory Data Workflows with Apache Arrow. On the website you’ll find a written tutorial, slides, and exercises designed to gently guide you along the learning journey. The content is aimed at learners who are new to Arrow but experienced with the R programming language. We first presented the material back in June to a group of 50 learners in a live online workshop that formed part of the useR! 2022 conference. However, the written content is designed to stand on its own. So if you’re having challenges working with larger-than-memory data in R, then the written tutorial and slide deck may be of interest to you.
The Larger-Than-Memory Data Workflows with Apache Arrow tutorial introduces the Arrow for R package, which provides R users a mature interface to the Apache Arrow toolbox. The tutorial is organized in three main sections:
The goal of the tutorial is to help new-to-Arrow R users overcome some common struggles when working with large data in R.
Okay, so you’re done with the tutorial and are looking for more. Where should you go next? New resources for Apache Arrow are being released regularly. While not complete, here is a list of some of our favorites:
If you like a video format, you can sit back and watch Doing More with Data: An Introduction to Arrow for R Users, a recent talk on the same topic from The Data Thread virtual conference.
We hope these resources help you make the most of Apache Arrow!
Photo credit: Hunter Harritt