Apache Arrow Version 9.0.0 Released
Alessandro Molina, Ian Cook · Aug 24, 2022
The Apache Arrow 9.0.0 release introduced new capabilities that make Arrow an even more powerful toolkit for data analytics. This release also established foundations for future work that will improve performance and interoperability. The community is pushing forward new features and promptly addresses any issues that arise, making it a reliable framework for projects to adopt.
Built-in support for Google Cloud Storage
The filesystem interface in the Arrow C++ library now includes built-in support for Google Cloud Storage (GCS). With this addition, plus existing support for Amazon S3, developers can now depend on the Arrow C++ library for reading and writing data files with these two popular cloud storage platforms using the same high-level interfaces that are used for working with local files. Users of Python and R will be excited to hear that this capability is exposed through PyArrow and the arrow R package (which both rely on the Arrow C++ library under the hood). For details, see the documentation pages for the Arrow C++ fileysystem interface, PyArrow filesystem interface, and the arrow R package filesystem interface.
Apache Arrow C++ Compute Engine – Now Called Acero
Since early 2021, the Arrow C++ developer community has been hard at work adding streaming query execution capabilities to the Arrow C++ library. These capabilities now enable developers to efficiently perform analytics and data processing tasks directly on Arrow columnar data, without depending on any other systems or tools. Together with the growing set of compute functions and the datasets functionality, this query execution machinery has opened up many new practical applications for the Arrow C++ library.
We are excited to share that this Arrow C++ query execution engine now has a name: Acero. You’ll see us using this new shorter name in future communications about the engine. For an introduction to Acero, you can watch the Acero talk from The Data Thread.
Acero Performance Improvements
The hash-join operator was rewritten to better utilize CPU cache and to do more to process data in batches (instead of rows) to allow compilers to better optimize for vectorized hardware. In addition, when a plan features a chain of hash-join operators, bloom filters created from downstream build tables may be utilized to eliminate rows further upstream and reduce the overall CPU cost.
Acero Temporal Functions
There are now more features for working with temporal data, including as-of join capabilities contributed by Ji Lin at Two Sigma (highlighted in his Data Thread talk) and a number of new temporal functions, including support for rounding timestamps, which is exposed through lubridate functions in the arrow R package.
Calendar based temporal rounding was added to existing temporal rounding kernels. This enables lubridate-like behavior of rounding. Bindings for lubridate’s floor_date, ceiling_date and round_date were also added in the Arrow R package. It is also possible to invoke the lubridate-like behavior in PyArrow.
The Growing Arrow Developer Community
The Arrow developer community is vibrant and continues to grow. Here are some Arrow 9.0.0 release statistics from Voltron Data and the Arrow community.
- 529 commits from 114 contributors in apache/arrow
- 43 of those contributors were first-time contributors to apache/arrow
- 38 contributors to this release are supported by Voltron Data
The Data Thread
The inaugural The Data Thread conference provided a forum for 52 speakers from 30 different companies across 13 countries to share their unique perspectives on Apache Arrow, Ibis, and so much more. The Data Thread wouldn’t have been a success without the contributions and participation of the Arrow community, so thank you for your support! More than 12 hours of exclusive content is available now.
To continue the momentum, we’re hosting a new limited-run series called Pulling the Thread that features some of your favorite conference speakers. These live one-hour sessions will take place on select Wednesdays through November and provide an opportunity to ask questions and gain greater insights into Arrow. Subscribe to The Data Thread YouTube channel to receive updates on the upcoming events.
Getting Started with Apache Arrow
Remember to check out the Apache Arrow Contributor’s Guide for details on how to contribute to the project. The guide takes you through the whole process of setting up a working development environment for Arrow, finding a good first issue you can tackle to get accustomed to the project, and then submitting a pull request (PR) to get your proposal reviewed and merged. The whole process is described in the Step by Step section of the guide. The New Contributors Guide also includes an overview of the communication channels used by Arrow developers, so that if you ever have questions or get stuck working on the code, you can ask for help.
If you’re working within the Apache Arrow ecosystem, we’re here to support you. Check out Voltron Data Enterprise Support subscription options today.
Photo credit: https://www.flickr.com/photos/n28307/45758713564/