Jun 07, 2022

Arrow 8.0.0 Release Brings New Functionality for PyArrow, Arrow Flight, C++ Engine, and More

Alessandro Molina, Will Jones

Lufthansa Boeing contrails

Apache Arrow 8.0.0 was released, bringing new bug fixes and improvements in the C++, C#, Go, Java, JavaScript, Python, R, Ruby, C GLib, and Rust implementations. In case you missed it, here’s the release blog post that includes a full list of new features and enhancements.

Arrow 8.0.0 brings critical updates to the Arrow C++ compute engine and Arrow Cookbook that we want to call to attention. In this post, you’ll learn about a few more prominent changes and why they matter.

Temporal Functions

Temporal functions are exposed through the R bindings to allow users to call many lubridate functions in dplyr code and have them run on the Arrow C++ engine. Arrow 8.0.0 added over 20 additional temporal functions, improving API parity in the R bindings with lubridate. These new temporal compute kernels include:

  • addition, subtraction, and multiplication for dates and durations (timestamp arithmetic was added in 7.0.0);
  • casting integers to durations, exposed in R as constructors to create X number of days, hours, months, or years;
  • a new “is_dst” function to compute whether the input timestamps fall within daylight saving time (DST);
  • a new “is_leap_year” function to compute whether the input timestamps fall within a leap year.

While the R bindings guided the development of these features, these compute functions are part of the same kernel library that is callable from the C++ library and PyArrow.

For example, using the new functions in R:

library(arrow)
library(dplyr)
library(lubridate)

arrow_table(
  dates = rep(as.POSIXct("2021-01-01T00:00:00"), 3),
  hours = c(1, 2, 3)
) %>%
mutate(
  # New dhours() helper converts integers to durations of hours
  hours = dhours(hours),
  # durations can now be added to timestamps 
  result = dates + hours
) %>%
pull(result)
 
[1] "2021-01-01 01:00:00 PST" "2021-01-01 02:00:00 PST" "2021-01-01 03:00:00 PST"

And calling the equivalent ones in Python:

from datetime import datetime

# can now cast integers to durations
hours = pc.multiply(60 * 60, pa.array([1, 2, 3])).cast(pa.duration('s'))

# can now add durations to timestamps
pc.add(pa.array([datetime(2022, 1, 1)] * 3), hours)

[
2022-01-01 01:00:00.000000,
2022-01-01 02:00:00.000000,
2022-01-01 03:00:00.000000
]

Support for Joins in PyArrow

PyArrow gained support for joining tables and joining datasets, powered by the C++ compute engine. With an API that should be familiar to pandas users, the engine allows joining data based on key columns. In addition to the inner and outer joins supported by pandas, PyArrow also supports semi- and anti-joins which remove rows that don’t exist or do exist on the other side of the join. This long-awaited feature, along with the existing compute kernel and group-by-aggregate functionality, should allow many new workloads to be done entirely in PyArrow.

Given two simple tables:

import pyarrow as pa
import pandas as pd

tab1 = pa.table({
"x": pa.array([1, 2, 3]),
"y": pa.array(["a", "b", "c"]),
})

tab2 = pa.table({
"x": pa.array([1, 1, 2, 4]),
"z": pa.array(range(4))
})

They can be left-joined together:

# Previously, using merge() in pandas
# tab1.to_pandas().merge(tab2.to_pandas(), on="x", how="left")

# Now, in PyArrow
tab1.join(tab2, keys="x", join_type="left outer")

pyarrow.Table
x: int64
y: string
z: int64
----
x: [[1,1,2,3]]
y: [["a","a","b","c"]]
z: [[0,1,2,null]]

And a left anti-join can be used to remove keys in tab1 that exist in tab2:

tab1.join(tab2, keys="x", join_type="left anti")
pyarrow.Table
x: int64
y: string
----
x: [[3]]
y: [["c"]]

Initial Work to Support Substrait

The C++ compute engine now has its first iteration of a Substrait consumer, giving it experimental support for running query plans generated by other libraries. If you haven’t heard of Substrait, we’d highly recommend reading our introductory blog post. In short, Substrait is a new open protocol for passing around query plans, in a similar way that Arrow is an open protocol for passing around data. Work is already underway in the Ibis Project and DuckDB to create more plan producers and consumers.

While the Substrait consumer support is in early days and evolving with the spec, we are excited for what this will unlock in the near future. Not only will this protocol soon replace our dplyr bindings in the R arrow package, it will also form the foundation for an Ibis front-end, providing a deferred-execution API to Python users.

Tracing with OpenTelemetry

OpenTelemetry tracing is now available to track run times of kernel functions and execution plan nodes. Like the Substrait support, these are some of the first changes within a larger Arrow initiative and will be enhanced in future Arrow versions. Arrow developers are using it to profile and optimize the engine, but it will also be useful to advanced users of Arrow looking to profile systems built using the engine.

Improved Extension Type Support Powering Geospatial Extensions

The R bindings gained better support for extension types: they can now be easily registered from R. Extension types can be used to map special semantics onto Arrow types. For example, you might register a struct array of latitudes and longitudes as an extension type representing an array of geographic coordinates. In addition, the new as_arrow_array S3 method allows you to define conversions from R objects into the appropriate extension array type, which is particularly important if there is metadata that needs to be kept alongside the array data. For example, geographic coordinates need to be associated with their coordinate system. See an in-depth example of registering custom extension types in the R docs.

Extension array improvements are being pushed forward by ongoing work for geospatial support in the Arrow ecosystem. Arrow’s C Data interface is un-locking zero-copy sharing of geospatial data between programming languages and tools. And Arrow’s Parquet read and write support is providing an efficient way to save large geo-spatial datasets. To learn more about the motivation for that work and its current state, check out Dewey Dunnington’s blog post Building Bridges: Arrow, Parquet, and Geospatial Computing.

New Content in the Arrow Cookbook

The Arrow Cookbook now has a Java section, providing helpful recipes for common tasks in Arrow. This includes examples of manipulating Arrow objects, reading and writing data, and an example of Arrow Flight service.

In addition, the C++ and Python cookbooks gained several new examples for Arrow Flight:

  • PyArrow Flight: Authentication with user/password
  • C++ Flight: Setting gRPC client options, and Flight Service with other gRPC endpoints

The cookbooks are an essential resource for learning to solve practical problems with Arrow. If you have recipes to contribute or questions to ask, the cookbook repository is open to contributions and questions.

New Contributions from Voltron Data

Voltron Data continues to make major contributions to Arrow and the Arrow community. Here are some statistics on our contributions:

  • 581 commits to apache/arrow
  • 21 commits to apache/arrow-cookbook
  • Voltronauts authored 66% of commits and merged 72% of commits
  • 37 Voltronauts committed
  • 7 first-time Voltronaut contributors

The Data Thread

In a few weeks, we’ll be hosting The Data Thread: a free, virtual conference for the Apache Arrow community. On June 23rd, the event will start with a live keynote by Arrow co-creators Wes McKinney and Jacques Nadeau followed by 25+ live and pre-recorded talks on developments across the Arrow ecosystem. Guest speakers include Paige Bailey (Product Lead at Anyscale), Pedro Pedreira (Software Engineer at Meta), and Randy Zwitch (Head of Developer Relations at Streamlit). See the latest blog post to learn more. To register, visit thedatathread.com.

Getting Started with Arrow

Remember to check out the Apache Arrow Contributor’s Guide for details on how to contribute to the project. The guide takes you through the whole process of setting up a working development environment for Arrow, finding a good first issue you can tackle to get accustomed to the project, and then submitting a pull request (PR) to get your proposal reviewed and merged. The whole process is described in the Step by Step section of the guide. The New Contributors Guide also includes an overview of the communication channels used by Arrow developers, so that if you ever have questions or get stuck working on the code, you can ask for help.

Conclusion

The Apache Arrow 8.0.0 release introduced new capabilities that make it an even more powerful toolkit for data analytics. This release also established foundations for future work that will improve performance and interoperability. The community is pushing forward new features and promptly addresses any issues that arise, making it a reliable framework for projects to adapt.

Additionally, Voltron Data offers services designed to accelerate success with the Apache Arrow ecosystem. Whether you’re experimenting or running production applications with Arrow, learn how we can help you.



Photo credit: https://www.flickr.com/photos/9754872@N08/50551596822