Today, databases are so ubiquitous it’s hard to do anything without running into one. Whether depositing money into a bank account, digging through sales data to predict future trends, or even just logging in for the day, nearly everything we do touches a database in some way.
Of course, to benefit from a database, we need to be able to put data into it and fetch data back out. We could write code for each database that we want to use in every application we develop, but that has obvious problems. For one, we’d find ourselves with a quadratic explosion of code to write and maintain. And it would be hard to switch between databases, or try a new database in a new project—since even just running a query would need new code.
Instead, developers reach for APIs like JDBC and ODBC. They provide ways to interact with databases generically, for creating connections, running queries, and so on, without having to worry (as much) about the exact database on the other side. Because these APIs are standard, developers can use them instead of a database-specific API and focus on the application itself. Database vendors benefit, too, since they just have to provide drivers to plug their database into these APIs, instead of having to design, develop, and test an entirely new API from scratch.
For a typical web application or business, these mature and battle-tested APIs cover all but the most specialized needs. They did not anticipate every need, however. Data science applications benefit from ingesting as much data as possible, as quickly as possible, and from distributing this work across multiple machines to scale. Data warehouses and data lakes now store terabytes or petabytes, and a query might return a significant fraction of that data. And many of these systems, both databases and their users alike, have adopted columnar data—using projects like Apache Arrow—for the performance and efficiency benefits it brings.
Here, we see these APIs start to break down. JDBC is row-oriented, adding overhead when working with columnar systems. ODBC may offer columnar data access, but still requires converting data when its format doesn’t match the application’s. And both of these APIs focus on returning results to a single client, but modern systems like Apache Spark, Apache Flink, and others process query results with multiple workers for horizontal scalability.
What does this all mean? Right now, getting data into a data pipeline using an API like JDBC means converting data from row-based APIs into a columnar format, which is purely wasted effort. If the database itself is columnar, then there’s even more waste, as the driver has to convert columns to rows for JDBC, only to convert those rows right back into columns! And to take advantage of multiple machines for scalability, we have to manually partition the query to split up results—there’s no way for a datastore like Dremio or Snowflake to expose this parallelism.
We could use ODBC instead, as it offers bulk data access. But there’s still work to convert data from ODBC’s representation into something like Arrow that applications use, as projects like Turbodbc implement. So for a columnar datastore, implementing an ODBC driver is a large undertaking that still introduces overhead and doesn’t give us a way to tell the user how they can best parallelize their query.
Apache Arrow, as well as the broader data science ecosystem, has a natural, obvious affinity for databases, as both are based on tabular data. Indeed, research into columnar databases both influenced the design of and motivated the creation of Arrow. With the introduction of the Arrow Flight SQL protocol, the community has returned to its roots to apply Arrow to database APIs.
Flight SQL provides a standardized API for interacting with databases, much like JDBC and ODBC. Unlike those APIs, Flight SQL was designed for columnar data and distributed execution from the start, by building on Apache Arrow and Arrow Flight. As we’ve discussed before, Arrow Flight lets us reduce or eliminate data serialization overhead, so databases and applications can efficiently send columnar data back and forth. Our deep dive showed how Flight accomplishes this by providing building blocks for developers on top of technologies like gRPC. Flight SQL just takes these components and defines how to use them to talk to databases.
Let’s reconsider our data pipeline from before. If we target Flight SQL and Arrow, we can eliminate pointless data shuffling, meaning more of the cloud bill can go towards productive work. Flight not only accelerates transferring Arrow data across the network for even more speedups, but also builds in the idea of data being split among multiple “endpoints”, so a client can fetch these in parallel—or distribute them across multiple machines. And for our columnar database, implementing Flight SQL is less work because the Arrow libraries handle many of the low-level details.
Under the hood, Flight SQL is straightforward. Flight itself provides a set of RPC calls and extensible data structures. So Flight SQL “fills in the blanks”, defining a protocol for using Flight to fetch the result of SQL queries and to get database metadata. The rest is just doing all the work of ensuring all the different use cases and features are covered; by using Arrow and Flight, we automatically get the benefits of columnar data and fast network transfers. For all the nitty-gritty details, see the protocol definition.
Existing APIs like JDBC and ODBC have served the community well, but we see an opportunity to improve performance and reduce implementation costs for data science workloads. Flight SQL provides similar functionality to these APIs, but leverages Apache Arrow as a columnar data representation and Arrow Flight for accelerated data transfers, eliminating the overhead from converting data between different formats and consolidating much of the implementation effort for databases and clients in open-source libraries.
Flight SQL is available in Apache Arrow 7.0.0, but there’s more to come. The community is already working on applications built on top of Flight SQL, as well as extensions to the protocol. If you’re interested, chime in on the mailing list or GitHub.
We’d like to thank our friends at Dremio and Bit Quill Technologies for proposing and contributing to Flight SQL, particularly James Duong, as well as the Arrow community for helping review and improve it.