Jan 04, 2022

Apache Arrow 7.0.0 – What to Expect

Alessandro Molina, Ian Cook

Light Painting at Night

First of all, we would like to wish everyone Happy New Year 2022! We hope it will bring joy and happiness to all in the Apache Arrow community and beyond. As many of us were wrapping presents and preparing for holidays in December 2021, the Arrow developer community was also planning the upcoming release of Apache Arrow 7.0.0; the release is now targeted for the 3rd week of January.

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory analytics. Over the last 6 years, the Arrow community has built a large collection of libraries in more than 10 programming languages to handle everything from reading Parquet files to vectorized SQL execution to building high-throughput data services. Built on top of an efficient in-memory columnar data format, Arrow has become the de facto standard for accelerating data interchange and in-memory analytics, bringing together the best-of-breed tools from the data science and database ecosystems.

At Voltron Data we believe in an Arrow-Native future and many of our team members work hand in hand with the community, fixing bugs and working on new Arrow functionality. We would like to share with you the new features coming to Arrow packages that we at Voltron Data are extremely excited about!

Database Connectors Take Flight

Arrow Flight is a framework for building high-performance Arrow-native data services. It is built on top of gRPC and the Arrow IPC format, and makes it simple to build distributed computing services that send and receive Arrow datasets. Flight is optimized to transfer the Arrow data between clients and servers, freeing the application developers from having to design bespoke transport solutions, and enabling collaborative data processing from any programming language that supports it.

Database drivers are an important type of data service that we would like to see accelerated with Arrow as well. Since Flight is agnostic to the details of SQL database drivers—such as those built using the ODBC and JDBC protocols—the community has been working on Flight SQL, a middleware framework for developing Flight-powered database interfaces. By adopting Flight SQL in database systems, we can achieve the same kind of API standardization as with ODBC and JDBC while enjoying the accelerated IO throughput enabled by Arrow and Flight. Starting with Arrow 7.0.0, the Flight SQL capabilities will be available in C++ and Java and the work done can be viewed at ARROW-14421 and ARROW-12922.

Aggregating Data in PyArrow

One of the recent growth areas for the project has been in accelerated execution of SQL or data frame operations natively against streaming Arrow datasets. This provides for high-performance data analytics inside Arrow-based applications on data that does not fit into memory. Community contributors have been developing engines in both Rust [(DataFusion)] (https://arrow.apache.org/datafusion/){:target=”_blank”} and C++ [(first proposed in 2019)] (https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit#heading=h.2k6k5a4y9b8y){:target=”_blank”}. These engines have been growing to provide comprehensive support for data frame and analytical SQL operations. In the [Arrow 6.0.0 release] (https://arrow.apache.org/blog/2021/11/08/r-6.0.0/){:target=”_blank”}, support for grouped aggregations (“group by”) became available to R users (via the C++ query engine) using dplyr. However, the necessary bindings had not yet been created to enable the same functionality in Python. This work is being finalized [(see ARROW-14607)] (https://issues.apache.org/jira/browse/ARROW-14607){:target=”_blank”} and will be available in the 7.0.0 release, making it possible to read and aggregate large datasets efficiently with Arrow.

Resources for New Contributors

The community members have been working consistently to reduce the barrier to entry to the Arrow project both for new users and potential open source contributors. The recently-released Arrow Cookbook is intended to help users get started with the Arrow libraries, while the forthcoming New Contributors Guide aims to guide developers through making their first contributions to the project.

To be published with the Arrow 7.0.0 release, this guide for new contributors explains the project, its architecture, and future growth areas. It contains step-by-step instructions on how to start contributing to Arrow, from reporting issues to submitting a first pull request. This work can be tracked at ARROW-14728.

Arrow-native Future is (Almost) Here

With each passing month, Arrow becomes more widely adopted as a standard for accelerating data access and in-memory computing. As Arrow-based query processing systems mature and become available everywhere, more and more applications can become Arrow-native, avoiding data serialization overhead and improving processing efficiency and memory use. Many established data processing frameworks, like Apache Spark, BigQuery, Dremio, and Snowflake have already adopted Arrow for speeding up data access, thus making it easier for users to adopt Arrow.

Arrow 6.0.0 brought a lot of new functionality. Arrow 7.0.0 is also shaping up to be epic. To stay up-to-date with the latest Apache Arrow news and releases, check out Voltron Data Enterprise Support subscription options.



Source: https://www.pexels.com/photo/light-painting-at-night-327509/