Arrow Database Connectivity: Apache Arrow for Every Database User
David Li · Jan 19, 2023
TL;DR: Arrow Database Connectivity (ADBC) is a single API for using different databases with Arrow data. Instead of tediously writing code to extract Arrow data out of different database APIs, both Arrow-native and not, just use the ADBC API instead and let the ADBC implementation take care of the work.
Columnar data makes data analysis fast: that’s what we believe as contributors to Apache Arrow. Projects from DuckDB to pandas to Apache Spark all share this same insight, using columnar data internally. Unfortunately, this often breaks down when putting different systems together. Protocols, APIs, and file formats are often row-based, forcing us to waste time converting data to move it between systems.
The Apache Arrow project is trying to solve these problems. For databases, Arrow Flight SQL provides a protocol for efficiently getting Arrow data in and out of databases. But that isn’t quite enough. Not every database is going to support Arrow Flight SQL, so you still have to write code to deal with other protocols and convert to columnar data if needed.
Normally, you’d reach for Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC), which put databases behind a generic API. Then you just use that API, and behind the scenes, drivers translate to the actual database for you. When you want Arrow data, however, they fall short. JDBC is row-based, and ODBC can use columnar data, but doesn’t quite mesh with Arrow. So, using these APIs with columnar data leaves you running in circles: you’re getting columnar data from the database, converting to rows for JDBC/ODBC, then converting back to columns on the other side!
Starting at the Finish Line
The basic idea behind JDBC and ODBC still makes sense: one API for every database. So the Arrow Database Connectivity (ADBC) project is just that, except with Arrow data. ADBC provides generic, Arrow-based APIs for database access. Then, ADBC drivers translate those APIs for the actual database. If the database speaks Arrow — whether through Arrow Flight SQL or something else — then the data gets passed through without conversion. Otherwise, the driver converts the data to Arrow format for you.
ADBC gives us a single way to get Arrow data in and out of databases, whether the database is columnar or not. And once you have Arrow data, you can easily use it with your favorite tools, whether that’s pandas, Ibis, or something else. Using Arrow and ADBC means you get to start at the finish line. The application doesn’t have to worry about converting data or vendor-specific SDKs, since ADBC handles all that.
Arrow Database Connectivity (ADBC) Basics
ADBC Language Compatibility
ADBC offers APIs for C/C++, Go, Java, Python, and Ruby. R bindings are in development. The long-term goal is for ADBC to be available in your language of choice.
ADBC Database & Driver Availability
Drivers are available for a few databases, with more planned. C/C++/Python/Ruby users can choose from drivers for Arrow Flight SQL, PostgreSQL, and SQLite. (Experimental integration with DuckDB is also present.) Java users have drivers for Arrow Flight SQL and JDBC. Go users can use any C/C++ driver through the standard database/sql interface and CGO.
Here’s a quick example using Python and SQLite. The Python API is based on Python’s DBAPI standard to make it easy to use:
Get Started with ADBC
- Read the API definitions for all the details,
- Check out the Python Quickstart for a tour of ADBC features,
- or look at the documentation to install and use the drivers.
ADBC is a new project, and we’re excited to see where it goes. If you’re interested in contributing or have feedback, let us know by opening an issue or joining the mailing list. And, if you’re already on board with Arrow, learn how a Voltron Data Enterprise Support subscription can help accelerate your success with Apache Arrow.