Apache Arrow Version 7.0.0 Released

In case you missed it, the latest set of updates to Apache Arrow were shared earlier this month through the Apache Arrow 7.0.0 release blog post. Today, we are highlighting a few of the more prominent changes that you can expect to encounter in addition to details regarding how you can become a contributor.

Flight SQL Has Landed in Apache Arrow

Apache Arrow 7.0.0 includes C++ and Java implementations of Flight SQL, a new API and protocol for interacting with SQL databases. Flight SQL uses Arrow to provide query results in a cross-language columnar memory format, unlike existing APIs which are row-based or use their own formats. This avoids unnecessary serialization/deserialization, which can account for as much as 80% of the runtime of a workload. Furthermore, it builds on Arrow Flight, a cross-language remote procedure call (RPC) framework specialized for Arrow data, as a wire protocol to further speed up data transfer and reduce the amount of work needed for a database to add support.

Compared to APIs like ODBC and JDBC, Flight SQL focuses on bulk queries and does not require managing cursors or ResultSets, reducing API complexity and overhead. Instead of writing a driver for each database, the included client will work with any database that supports the necessary Flight RPC methods. And Arrow Flight supports concurrent requests and fetching results in parallel, making it easier to scale out and avoid bottlenecks.

At a high level, clients use Flight SQL to execute queries and fetch metadata by sending requests over Arrow Flight, then reading the results as streams of Arrow record batches. As a brief example, we can query SQLite using the example Flight SQL server included in the Arrow C++ test suite. First build Arrow with the tests and Flight SQL enabled, then start the flight_sql_test_server before running the example:

text

/arrow/build$ ./debug/flight_sql_test_server &
Server listening on localhost:31337
/arrow/build$ ./debug/flight_sql_example -host localhost
Connecting to grpc+tcp://localhost:31337
Executing query: 'SELECT * FROM intTable WHERE value >= 0'
Read one chunk:
id: int64
keyName: string
value: int64
foreignId: int64
...snip...

Work on this project is still progressing, with the community currently implementing a JDBC driver using Flight SQL. To learn more or get involved, join the mailing list or chime in on GitHub, and read this introduction on the Arrow blog.

Aggregating Data in PyArrow

In version 7.0.0, PyArrow introduced the pyarrow.Table.group_by function that allows you to perform aggregations on top of the groupings you decide. In previous versions, grouped aggregation functions such as min, max, count, and sum existed in the Arrow C++ library, but they were not usable in Python as there was no way to declare a grouping to which those functions should be applied.

Thanks to the Table.group_by function introduced in version 7.0.0, it is now possible to declare groupings and then apply aggregation functions on top of them.

Given a simple Table like:

python

t = pa.table([
        pa.array(["a", "a", "b", "b", "c"]),
        pa.array([1, 2, 3, 4, 5]),
    ], names=["keys", "values"])

The grouping can be declared by specifying the list of columns that should be grouped together

python

t.group_by("keys")

And once you have a grouping you can apply the aggregations you care about. For example, we can count the non null values in the columns and sum them to have the total using both the count and sum aggregations.

python

t.group_by("keys").aggregate([
    ("values", "sum"),
    ("keys", "count")
])

The result will be a Table itself with a column for each specific aggregation.

python

pyarrow.Table
values_sum: int64
keys_count: int64
keys: string
----
values_sum: [[3,7,5]]
keys_count: [[2,2,1]]
keys: [["a","b","c"]]

For additional information regarding how to use grouped aggregations, you can refer to the Grouped Aggregations section of the Compute documentation page.

Contributing to Apache Arrow

When someone decides to contribute to an open source project, a common first question is “Where should I start?” The codebase of a big multi-year project can look complex and scary the first time you approach it. Even with navigable code, it might not be immediately obvious what steps you should take to propose changes or additions to the code, or how to ensure that your contributions align with the direction of the project.

To answer this and many other questions the New Contributor’s Guide was created in Apache Arrow. The guide takes you through the whole process of setting up a working development environment for Arrow, finding a good first issue you can tackle to get accustomed to the project, and then submitting a pull request (PR) to get your proposal reviewed and merged. The whole process is described in the Step by Step section of the guide.

Currently, tutorials exist for both Python and R to guide you through the process of contributing a new feature to Apache Arrow. The steps explained in the tutorials cover everything from properly describing the proposed new feature in a Jira issue, through creating and updating a PR to facilitate efficient review.

The New Contributors Guide also includes an overview of the communication channels used by Arrow developers, so that if you ever have questions or get stuck working on the code, you can ask for help.

Arrow 7.0.0 introduced a lot of new functionality that helps to simplify and speed up working with databases or contributing to the project. The community is pushing forward new features and promptly addresses any issues that arise, making it a reliable framework for projects to adapt.

If you’re a developer or company working within the Apache Arrow ecosystem, we’re also here to support you. Check out Voltron Data Enterprise Support subscription options today.

Photo credit: https://www.pexels.com/photo/bottom-view-of-plane-with-contrail-1436697/