From the start, Apache Arrow has aimed to bridge data ecosystems together. With the ADBC and Flight SQL projects under active development, the Arrow community is now working to simplify database connectivity for both clients and vendors. In this post, we’ll explain who each project tries to help and how they fit together.
Why two projects? Because database clients and vendors face overlapping but distinct problems.
Database clients face a choice when deciding how to get Arrow data. They could start from a tried-and-true, generic API like JDBC and ODBC. This choice makes it easy to work with different databases. But, neither API has native support for the Arrow columnar format, so the data has to be converted. Converting row-oriented data to columnar is costly. It adds extra development time and uses expensive hardware resources. Another option is to integrate with database-specific libraries that do offer columnar data. Some examples include clickhouse-cpp or google-cloud-bigquery. This option requires extra development work for each database that clients want to use. The ADBC project provides a simpler way for clients to get Arrow data. ADBC is a standard API for connecting Arrow-native clients with databases, engines, and storage. ADBC is modular, and abstracts over different wire protocols and database-specific libraries.
Database vendors also have to make a difficult choice when deciding how to serve up data to clients. If they implement existing generic APIs and protocols, they have to give up some of the benefits of columnar data. On the other hand, they could implement a custom API. But, they would have to then build out support for every different client, now and into the future. Flight SQL offers an alternative for vendors with a fully integrated, Arrow-native wire protocol. Flight SQL is flexible, and not only integrates with ADBC, but also with JDBC and ODBC.
In short: in the Arrow-native future, clients will use ADBC to connect to databases. ADBC abstracts over the Flight SQL protocol and other options, including JDBC/ODBC and custom clients. Vendors implement the Flight SQL protocol to be compatible with any client using ADBC.
Let’s look at each of these use cases more closely.
Database vendors can use Flight SQL to build on top of the Arrow Flight stack. Flight RPC provides a framework for transporting Arrow data. Flight SQL tells us how to use Flight RPC—what messages to send, what calls to make—for talking to databases that implement the Flight SQL protocol. Just like how projects like Google Cloud Spanner and Amazon Aurora reimplement the Postgres wire protocol to benefit from its ecosystem, you can do the same with Flight SQL and the Arrow ecosystem.
Despite the name, Flight SQL is not a SQL dialect or abstraction layer. SQL queries still need to comply with the database specific SQL syntax to communicate with the database. Flight SQL only acts as an intermediary between the client and the database—all it wants to do is send your query over safely and get your data backBut, on that note, there are discussions ongoing to generalize Flight SQL to Substrait plans, and Substrait does aim to abstract different query languages..
So Flight SQL is the “full stack” for database vendors:
In a nutshell, here is how they work together:
This means that if you are implementing a database, you can add support for Flight SQL to give your users Arrow-native data access.
Currently, database clients who want to access Arrow data have two options. Unfortunately, each option has a downside.
Some systems like Dremio, Google BigQuery, and Snowflake offer Arrow-native data access. But each of them has a different API, and that means extra code for each case. Flight SQL alone cannot help here. It would be like directly using libpq to talk to Postgres: it’s low level, and only helps if the database vendor implements it. And if not, it is hard for a client to retrofit it onto an existing database because it is a “full stack” solution.
Downside: You will benefit from Arrow data, but you will also have to write new code for each database you want to use.
Clients could use JDBC/ODBC. These client APIs abstract over different databases, which means less custom code for each database. However, custom code will need to be written to convert the data to Arrow’s columnar format at the end.
Downside: Converting data from JDBC/ODBC to Arrow slows your application down, and you do not benefit from the fact that many databases are Arrow-native underneath.
With the development of ADBC, there is now a third option for database clients. ADBC is a generic database interaction API that provides clients with a standard API for submitting queries to and fetching Arrow data from databases. Underneath, it abstracts over vendor specific drivers, which can in turn use any protocol, whether Flight SQL or something else. In that respect, ADBC takes inspiration from JDBC and ODBC, but is built for Arrow data and analytics use cases.
Unlike Flight SQL, ADBC is purely the client API that applications code to. This gives database vendors flexibility to bring their own wire protocol: ADBC could delegate to the BigQuery client, or Flight SQL, or even take data from JDBC and transpose it into Arrow data.
ADBC fits together naturally with Flight SQL: databases that implement Flight SQL will get ADBC support “for free”, much like how implementing the Postgres wire protocol gives you JDBC and ODBC drivers “for free”. Any vendor that implements Flight SQL would make their database immediately accessible via ADBC. But, whether Flight SQL is used or not, client applications that use ADBC are none the wiser.
Flight SQL aims to help database vendors that already support columnar data, but would currently have to design their own wire protocols, or bend existing ones to their will. ADBC aims to help database clients that currently have to write specialized code for connecting to each database they want to use, even if all they want is Arrow data in the end. Together, they make getting to a fast, Arrow-native future easy.
Both ADBC and Flight SQL are currently under active development. If you are a database vendor who already has adopted Arrow’s columnar format, we hope you’ll take a look at Flight SQL (available since Arrow 7.0.0) and check out the source repo (https://github.com/apache/arrow). If you are designing a database client and are interested in streamlining how users connect to Arrow data, please have a look at the ADBC source repo (https://github.com/apache/arrow-adbc).
To communicate with Arrow developers about either of these two projects, reach out on the Arrow mailing list: https://arrow.apache.org/community/#mailing-lists.