Oct 04, 2022
Data Transfer Between Python and R with rpy2 and Apache Arrow
In my last post, I showed how Apache Arrow makes it possible to hand over data sets from R to Python (and vice versa) without making wasteful copies of the data.
The solution I outlined there was to use the reticulate package to conduct the handover, and rely on Arrow tools on both sides to manage the data. In one sense it’s a perfectly good solution to the problem… but it’s a solution tailor made for R users who need access to Python. When viewed from the perspective of a Python user who needs access to R, it’s a little awkward to have an R package (reticulate) governing the handover. Perhaps we can find a more Pythonic way to approach this?
A solution to our problem is provided by the rpy2 library that provides an interface to R from Python, and the rpy2-arrow extension that allows it to support Arrow objects. Let’s take a look, shall we?
Setting up the Python Environment
For the purposes of this post I’ll create a fresh conda environment that I’ll call “continuation”, partly because this post is a continuation of the previous one and partly because the data set I’ll use later is taken from a database of serialized fiction called To Be Continued….
I was able to install most packages I needed through conda-forge, but for rpy2 and rpy2-arrow I was only able to do so from pypi so I had to use pip for that. So the code for setting up my Python environment, executed at the terminal, was as follows:
The purpose of therpy2 library is to allow users to call R from Python, typically with the goal of allowing access to statistical packages distributed through CRAN. I’m currently using version 3.5.4, and while this blog post won’t even come close to documenting the full power of the library, the rpy2 documentation is quite extensive. To give you a bit of a flavor of it, let’s import the library:
This does not in itself give us access to R. That doesn’t happen until we explicitly import either the
robjects module (a high level interface to R) or import the
rinterface model (a low level interface) and call rinterface.initr(). This post won’t cover rinterface at all; we can accomplish our goals using only the high level interface provided by
robjects. So let’s import the module and, in doing so, start R running as a child process:
You’ll notice that this prints a little startup message. If you’re following along at home you’ll probably see something different on your own machine: most likely you’ll see the standard R startup message here. It’s shorter in this output because I modified my
.Rprofile to make R less chatty on start up.
Anyway, our next step is to load some packages. In native R code we’d use the
library() function for this, but rpy2 provides a more Pythonic approach. Importing the packages submodule gives us access to
importr(), which allows us to load packages. The code below illustrates how you can expose the base R package and the utils R package (both of which come bundled with any minimal R installation) to Python:
Once we have access to utils we can call the R function install.packages() to install additional packages from CRAN. However, at this point we need to talk a little about how names are translated by rpy2. As every Python user would immediately notice,
install.packages() is not a valid function name in Python: the dot is a special character and not permitted within the name of a function. In contrast, although not generally recommended in R except in special circumstances, function names containing dots are syntactically valid in R and there are functions that use them. So how do we resolve this?
In most cases, the solution is straightforward: rpy2 will automatically convert dots in R to underscores in Python, and so in this instance the function name becomes install_packages(). For example, if I want to install the fortunes package using rpy2, I would use the following command:
There are some subtleties around function name translation, however. I won’t talk about them in this post, other than to mention that the documentation discusses this in the section on calling functions.
In any case, now that I have successfully installed the fortunes package I can import it, allowing me to call the
I’m rather fond of this quote, and it seems very appropriate to the spirit of what polyglot data science is all about. Whatever language or tools we’re working in, we’ve usually chosen them for good reason. But there is no tool that works all the time, nor any language that is ideal for every situation. Sometimes we need something very different, and when we do it is very helpful if our tools are able to talk fluently to each other.
We’re now at the point that we can tackle the problem of transferring data from Python to R, but in order to do that we’ll need some data…
About the Data
I’ve given you so many teasers about the data set for this post that it almost feels a shame to spoil it by revealing the data, but all good things must come to an end I suppose. The data I’m using are taken from the To Be Continued… database of fiction published in Australian newspapers during the 19th and early 20th century. Originally collected using the incredibly cool Trove resource run by the National Library of Australia, the To Be Continued… data are released under a CC-BY-4.0 license and maintained by Katherine Bode and Carol Hetherington. I’m not using the full data set here, only the metadata. In the complete database you can find full text of published pieces, and in the Trove links you can find the digitized resources from which they were sourced, but I don’t need that level of detail here. All I need is an interesting data table that I can pass around between languages. For that, the metadata alone will suffice!
To give you a sense of what the data set (that is, the restricted version I’m using here) looks like, let’s fire up pandas and take a peek at the structure of the table. It’s stored as a CSV file, so I’ll call
read_csv() to import the data:
|Trove ID||Common Title||Publication Title||Start Date||End Date||Additional Info||Length||Curated Dataset||Identified Sources||Publication Source||…||Other Names||Publication Author||Gender||Nationality||Nationality Details||Author Details||Inscribed Gender||Inscribed Nationality||Signature||Name Category|
|1||The Mystery of Edwin Drood||The Mystery of Edwin Drood||1871-03-04||1871-06-03||NaN||0.0||Y||LCVF||NaN||…||NaN||Dickens, Charles||Male||British||NaN||LCVF||Male||British||NaN||Attributed|
|2||The Mystery of Edwin Drood||The Mystery of Edwin Drood||1871-03-07||1871-05-16||NaN||0.0||Y||LCVF||NaN||…||NaN||Dickens, Charles||Male||British||NaN||LCVF||Male||British||NaN||Attributed|
|3||Sporting Recollections in Various Countries||Sporting Recollections in Various Countries||1847-06-16||1847-07-07||NaN||0.0||Y||WPEDIA||Sunday Times||…||NaN||Viardot, M. Louis||Male||French||NaN||WPEDIA||Male||British||NaN||Attributed|
|4||Brownie’s Triumph||The Jewels||1880-05-08||1880-08-14||NaN||0.0||Y||TJW||NaN||…||Sarah Elizabeth Forbush Downs; Downs, Mrs Geor…||Unattributed||Female||American||NaN||WPEDIA||Uninscribed||British||NaN||Unattributed|
|5||The Forsaken Bride||Abandoned||1880-08-21||1880-12-18||Fiction. From English, American and Other Peri…||0.0||Y||TJW||NaN||…||Sarah Elizabeth Forbush Downs; Downs, Mrs Geor…||Unattributed||Female||American||NaN||WPEDIA||Uninscribed||British||NaN||Unattributed|
5 rows × 28 columns
Okay, that’s helpful. We can see what all the columns are and what kind of data they contain. I’m still pretty new to data science workflows in Python, but it’s not too difficult to do a little bit of data wrangling with Pandas. For instance, we can take a look at the distribution of nationalities among published authors. The table shown below counts the number of distinct publications (Trove IDs) and authors for each nationality represented in the data:
|Trove ID||Publication Author|
|Unknown, not Australian||882||88|
It would not come as any surprise, at least not to anyone with a sense of Australian history, that there were far more British authors than Australian authors published in Australian newspapers during that period. I was mildly surprised to see so many American authors represented though, and I have nothing but love for the lone Italian author who published 12 pieces.
Now that we have a sense of the data, let’s add Arrow to the mix!
Pandas to Arrow Tables
To give ourselvesaccess to Apache Arrow from Pythonwe’ll use thePyArrow library. Our immediate goal is to convert the fiction data from aPandas DataFrame to an Arrow Table. To that end, pyarrow supplies a Table object with a
from_pandas() method that we can call:
fiction2 object contains the same data as
fiction but it is structured as an Arrow Table, and the data is stored in memory allocated by Arrow. Python itself only stores some metadata and the C++ pointer that refers to the Arrow Table. This isn’t exciting, but it will be important (and powerful!) when we transfer the data to R.
Speaking of which, we have arrived at the point where we get to do the fun part… seamlessly handing the reins back and forth between Python and R without needing to copy the Arrow Table itself.
Passing Tables from Python to R
Topass Arrow objects between Python and R, rpy2needs a little help because it doesn’t know how to handle Arrow data structures. That’s where the rpy2-arrow module comes in. As the documentation states:
The package allows the sharing of Apache Arrow data structures (Array, ChunkedArray, Field, RecordBatch, RecordBatchReader, Table, Schema) between Python and R within the same process. The underlying C/C++ pointer is shared, meaning potentially large gain in performance compared to regular arrays or data frames shared between Python and R through the conversion rules included in rpy2.
I won’t attempt to give a full tutorial on rpy2-arrow in this post. Instead, I’ll just show you how to use it to solve the problem at hand. Our first step is to import the conversion tools from rpy_arrow:
Having done that, the
pyarrow_table_to_r_table() function allows us to pass an Arrow Table from Python to R:
The printed output isn’t the prettiest thing in the world, but nevertheless it does represent the object of interest. On the Python side we have
fiction2, a data structure that points to an Arrow Table and enables various compute operations supplied through pyarrow. On the R side we have now created
fiction3, a data structure that points to the same Arrow Table and enables compute operations supplied by the R arrow package. In the same way that
fiction2 only stores a small amount of metadata in Python,
fiction3 stores a small amount of metadata in R. Only this metadata has been copied from Python to R: the data itself remains untouched in Arrow.
Accessing the Table from the R Side
We’re almost done, but the tour isn’t really complete until we’ve stepped out of Python entirely, manipulated the object on the R side, and then passed something back to Python. So let’s do that next.
In order to pull off that trick, it’s helpful to imagine that I’m writing this document using a jupyter notebook. In that context we could employ a little notebook cell magic, again relying on rpy2 to supply all the sparkly bits. To help us out in this situation, the rpy2 library supplies an interface for interactive work that we can invoke in a notebook context like this:
Now that we’ve included this line, all I have to do is preface each cell with
%%R and the subsequent “Python” code will be passed to R and interpreted there. To start with, I’ll load the dplyr and arrow packages, using the
suppressMessages() function to prevent them being chatty:
Having loaded the relevant packages, I’ll use the dplyr/arrow toolkit to do a little data wrangling on the
fiction3 Table. I’m not doing anything fancy, just a little cross-tabulation counting the joint distribution of genders and nationalities represented in the data using the
count() function, and using
arrange() to sort the results:
The output isn’t very informative, but don’t worry, by the end of the post there will be a gender reveal, I promise. Besides, the actual values of gender aren’t important right now. In truth, the part that we’re most interested in here is the first line of code. By using
%%R -i fiction3 to specify the cell magic, we’re able to access the
fiction3 object from R within this cell and perform the required computations.
Oh, and also we now have a new gender object in our R session that we probably want to pull back into Python!
A Journey Home: A Tale of Four Genders
Okay. So we now have an object in the embedded R session that we might wish to access from the Python session and convert to a Python object. First we’ll pass the Arrow Table from R to Python and then convert to a Pandas DataFrame. Here’s how that process works. If you recall from earlier in the post, we imported
robjects to start the embedded R session. When we did so, we also exposed
robjects.r, which provides access to all objects within that R session. To create a Python object
gender2 that refers to the R data structure we created in the last section, here’s what we do:
Importantly, notice that this is the same object. The
gender2 variable still refers to the Arrow Table in R: it’s not a pyarrow table. If we want to convert it to a data structure that pyarrow understands, we can again use the rpy-arrow conversion tools. In this case, we can use the
Just like that, we’ve handed over the Arrow Table from R back to Python. Again, it helps to remember that
gender2 is an R object and
gender3 is a Python object, but both of them point to the same underlying Arrow Table.
In any case, now that we have
gender3 on the Python side, we can use the
to_pandas() method from
pyarrow.Table to convert it to a pandas data frame:
63 rows × 3 columns
And with that our transition home is complete!
This post has wandered over a few topics, which is perhaps to be expected given the nature of polyglot data science. To make it all work smoothly, I needed to think a little about how my Python and R environments are set up, and there are various small frictions that inevitably arose. The R and Python libraries implementing Apache Arrow make it look seamless when we handover data from one language to another – and in some ways they actually do make it seamless in spite of the many little frictions that exist with Arrow, no less than any other powerful and rapidly-growing tool – but a lot of work has gone into making that transition smooth. Whether you’re an R focused developer using reticulate or a Python focused developer who prefers rpy2, the toolkit is there. I’m obviously biased in this because so much of my work revolves around Arrow these days, but at some level I’m still actually shocked that it (and other polyglot tools) works as well as it does. Plus, I’m having a surprising amount of fun teaching myself “Pythonic” ways of thinking and coding, so that’s kind of cool too. Hopefully this post will help a few other folks get started in this area! If you’d like Voltron Data’s help in getting started or optimizing your work with Arrow, head over to our products page and check out our subscription services (there’s even a free version). Examining historical manuscript metadata is a hoot, but the real fun begins when we can start helping you solve your real world data challenges.
In writing this post I am heavily indebted to Isabella Velásquez, whose fabulous post on calling R from Python with rpy2 helped me immensely. The documentation on integrating PyArrow with R was extremely helpful too! Thank you to Kae Suarez for reviewing the original post, and to Keith Britt and Maura Hollebeek for their help in adapting it for this blog.
- The dot is typically used to denote an S3 method in R, but for historical reasons this is not universally true.
- Depending on how fresh your R configuration is, you may need to specify which CRAN mirror you want to download the package from before attempting the installation. To do that, include a command like
utils.chooseCRANmirror(ind=1)to select the first mirror on the list of known servers.