Danielle Navarro
Jonathan Keane
Stephanie Hazlitt
Work in Your Language of Choice
Multi-language tools allow users the freedom to work where they want, how they want, unleashing greater productivity. We design and build composable data systems that enable users to write code once, in any language, on any device, and at any scale to develop pipelines faster.
Ever tried to work with data too large for your R session’s memory? Familiar with Error: R cannot allocate memory
or R Session Aborted 💣? Or your R code is running, but taking more and more time to complete as your data gets larger? Well, you are not alone. It is becoming commonplace for data scientists to work with larger-than-memory data, and these issues become more frequent as your data grows.
How do we fix these problems? What tools are available for an R user who wants to solve them without necessarily learning a new data analysis syntax?
Meet Apache Arrow, a multi-language toolbox for working with larger-than-memory data. Started in 2016, the Arrow project is designed to allow data scientists to work efficiently with very large data sets in the programming language of their choice. To assist data analysts who work in R and want to learn more about Arrow, we’ve put together a tutorial on Larger-Than-Memory Data Workflows with Apache Arrow. On the website you’ll find a written tutorial, slides, and exercises designed to gently guide you along the learning journey. The content is aimed at learners who are new to Arrow but experienced with the R programming language. We first presented the material back in June to a group of 50 learners in a live online workshop that formed part of the useR! 2022 conference. However, the written content is designed to stand on its own. So if you’re having challenges working with larger-than-memory data in R, then the written tutorial and slide deck may be of interest to you.
The Larger-Than-Memory Data Workflows with Apache Arrow tutorial introduces the Arrow for R package, which provides R users a mature interface to the Apache Arrow toolbox. The tutorial is organized in three main sections:
Okay, so you’re done with the tutorial and are looking for more. Where should you go next? New resources for Apache Arrow are being released regularly. While not complete, here is a list of some of our favorites:
If you like a video format, you can sit back and watch Doing More with Data: An Introduction to Arrow for R Users, a recent talk on the same topic from The Data Thread virtual conference.
https://www.youtube.com/watch?v=O42LUmJZPx0
If you’re working within the Apache Arrow ecosystem, we’re here to support you. Check out Voltron Data Enterprise Support subscription options today.
Photo credit: Hunter Harritt