Jan 19, 2023

Apache Arrow Flight SQL: Arrow for Every Database Developer

David Li

Photo of airplane in sky
note icon
TL;DR: Apache Arrow Flight SQL is a generic protocol for columnar databases. Thanks to the Arrow community, a database that implements this protocol gets drivers for Java Database Connectivity (JDBC), Open Database Connectivity (ODBC), and Arrow-native applications — without having to build any database-specific drivers.

Modern databases have many exciting new features. One example is making queries faster with columnar data — the same core idea behind Apache Arrow. DuckDB, Google’s AlloyDB, and others use columnar data to be faster at the Online Analytical Processing (OLAP) queries and data analysis tasks that people throw at them.

Unfortunately, the protocols used to talk to the databases are often still row-based. That means wasting time to convert the data for transmission. For example, AlloyDB uses the PostgreSQL wire protocol, so it has to convert from columnar data to the protocol’s row based format before sending data to the client. Worse, clients still want columnar data—they want to use pandas, Apache Spark, R, and all their favorite tools. So they have to convert the data back!

Databases are stuck between a rock and a hard place here, though. By using existing protocols, they get compatibility with existing drivers. Designing a new protocol instead would also mean building JDBC and ODBC drivers for people to use, which is a lot of effort that not every developer wants to take on. That may explain why so many systems go down this route.

Having Our Columnar Cake and Eating it Too

Can we get the best of both worlds? For instance, a protocol that:

  1. Avoids unnecessary data conversions
  2. Can easily be implemented by different databases
  3. Supports a variety of clients

Yes, we can with Apache Arrow Flight SQL a columnar protocol for talking to databases. It defines commands for executing queries, getting metadata, and fetching query results as Arrow data. Arrow Flight SQL isn’t specific to a particular database—as long as a database implements these commands, a client can send a query and get a result. (So it’s purely a protocol, and not a SQL dialect—don’t let the name confuse you.)

Designing a new protocol is a lot of work, as mentioned. Thankfully, the Apache Arrow community has already done the hard work. The Arrow project provides C++, Go, and Java libraries for databases to implement the protocol. Then, users can get JDBC and ODBC drivers from the Arrow community to talk to the database. And Arrow-native clients can use the Arrow Database Connectivity (ADBC) driver to get Arrow data directly (read our blog about ADBC here).

Diagram showing how Arrow data is passed through clients, databases, and drivers.

Under the hood, Arrow Flight SQL builds on other Arrow projects. It transfers data in the Arrow columnar format. This helps make things fast, and Arrow-native clients and servers don’t have to convert data on either end. And instead of building the networking layer from scratch, it extends Arrow Flight RPC, a fast, low-level protocol designed for transporting Arrow data, which avoids unnecessary data copies by hooking into frameworks like gRPC.

A Driver for Every Occasion

On top of that, no matter what your client API of choice is, there are drivers that speak Arrow Flight SQL. For example, Arrow 10.0.0 includes a JDBC driver, and Dremio offers an ODBC driver. They can be used with any database that supports Arrow Flight SQL—that could be yours!

Of course, there’s one glaring problem when using JDBC/ODBC here. We wanted to avoid converting between columnar and row-based data, but JDBC/ODBC are generally row-based (though, ODBC can be bent). So, data makes it all the way to the client in columnar format…only to be converted at the last second to fit the client API.

A client could use Arrow Flight SQL directly to avoid that problem, which is nice if the database supports it. But not every database will. Indeed, some databases, like ClickHouse, already have their own Arrow-based protocols. To cover all our bases, we’d have to integrate with Arrow Flight SQL, other Arrow-based protocols, and protocols for row-based databases. That’s a lot of work, and as the old saying goes: the solution is an abstraction. We want something like JDBC and ODBC, except designed for columnar data. On that note, make sure to read our post about Arrow Database Connectivity (ADBC).

Get Started with Apache Arrow Flight SQL

And if that isn’t enough, learn how a Voltron Data Enterprise Support subscription can help accelerate your success with Apache Arrow.

Photo by: Griselda Servin