Jun 20, 2023

Zero-Copy Sharing using Apache Arrow and Golang

Matthew Topol

Zero-Copy In-Process Data Sharing
note icon
Voltron Data is building open source standards that support a multi-language language future. Central to this vision is the Apache Arrow project.

To demonstrate Arrow’s cross-language capabilities, Matt Topol, Staff Software Engineer and author of "In-Memory Analytics with Apache Arrow", wrote a series of blogs covering the Apache Arrow Golang (Go) module. This series will help users get started with Arrow and Go and see how you can use both to build effective data science workflows.

This is the final post in our four-part series. Access the full series below:

It’s time for the final post in our series to get you started with Apache Arrow and Golang. Our previous post covered how to efficiently send your data across the network using Arrow IPC and Arrow Flight RPC. In this post, we’re covering a different situation: sending data within the same process by sharing the memory directly without copying. Let’s hop down this rabbit hole!

Caring is Sharing… Your Local Memory

Okay, picture this: You have an awesome data utility that can handle a useful task, but it doesn’t have any way to be directly callable from your environment of choice (in our case Go, but this could be any language / environment). One way to do this might be to write your data out to some file and then use the utility to read that file, but if your data is large enough that can be very costly in disk space, memory, and CPU time. Alternatively, the utility can be a service you call, but that means you have to pay the cost to send the data across the network. What if we could hand the utility a pointer to the data and then use it “as-is” without any copying? With the Arrow C data interface, you can!

You can read more about the rationale and goals behind the C data interface in the Arrow docs, but the point I’m getting at is that the Go package provides utilities to both import and export data via this interface. The drawback is that it does require CGO which has a few caveats that I won’t get into here.

Let’s walk through a simple example: suppose you want to utilize DuckDB. Well, DuckDB has an Apache Arrow compatible interface that is exposed and uses the C data interface so that you can avoid extra copies of the results. One way you could utilize this is as follows:

First, we set up the necessary C flags to link against libduckdb.so and include the header:

// #cgo LDFLAGS: -lduckdb
// #include <duckdb.h>
import "C"

Then, we’ll have a function that accepts a query string and returns the results or an error. In a real situation, we’d want to store the pointers to the DuckDB connection and database, but for our purposes here we’ll just close them at the end of the function with defer.

import (
    ...
    "github.com/apache/arrow/go/v12/arrow/cdata"
    ...
)

func queryDuckDB(query string) (arrow.Array, error) {
    var (
        db C.duckdb_database
        cnxn C.duckdb_connection
        result C.duckdb_arrow
        dbpath string = ...
    )
		cpath := C.CString(dbpath)
    defer C.free(unsafe.Pointer(cpath))

    // in a real scenario, you'd keep the db and connection open longer
    // than just for the length of this function call, but this serves the
    // example fine
    if state := C.duckdb_open(cpath, &db); state == C.DuckDBError {
        return nil, errors.New("open error")
    }
    defer C.duckdb_close(&db)

    if state := C.duckdb_connect(db, &cnxn); state == C.DuckDBError {
        return nil, errors.New("connect error")
    }
    defer C.duckdb_disconnect(&cnxn)

    // now we can query the database!
    ...
}

Finally, we can send the query and import the result data without having to copy it: by using the pointers.

cquery := C.CString(query)
defer C.free(unsafe.Pointer(cquery))

state := C.duckdb_query_arrow(cnxn, cquery, &result)
if state == C.DuckDBError {
	return nil, errors.New("query error")
}
defer C.duckdb_destroy_arrow(&result)

// okay, now we can actually fetch the data!
var schema cdata.CArrowSchema
var arr cdata.CArrowArray

state := C.duckdb_query_arrow_schema(result,
     (*C.duckdb_arrow_schema)(unsafe.Pointer(&schema)))
if state == C.DuckDBError {
    return nil, errors.New("schema error")
}

state := C.duckdb_query_arrow_array(result, 
        (*C.duckdb_arrow_array)(unsafe.Pointer(&arr)))
if state == C.DuckDBError {
    cdata.ReleaseCArrowSchema(&schema)
    return nil, errors.New("array error")
}

_, arr, err := cdata.ImportCArray(&arr, &schema)
if err != nil {
    return nil, err
}

// the arrow.Array now owns the C allocated memory and ArrowArray's
// release callback will be called when the internal ArrayData's
// refcount goes to 0 and it is cleaned up
return arr, nil

Next Steps…

Hopefully at this point, I’ve presented a compelling case for utilizing Apache Arrow and Go to write useful utilities and services for manipulating data and/or building workflows! If you want to learn more, you’ve got a few options:

  • As mentioned, you can read the documentation on pkg.go.dev
  • Check out my book, “In-Memory Analytics with Apache Arrow” for many more examples and in-depth descriptions of the Arrow format and use cases. (Note: It also has a corresponding GitHub repository with all the code samples from the book, released under the MIT license.)
  • Check out ADBC if you want to be able to query various databases (like DuckDB) easily, with all the low-level work done for you already!

Arrow Golang

It has been a pleasure to present these Arrow and Golang examples with you. If you’re interested in learning more about how Voltron Data helps enterprises design and build data systems using projects like Arrow, you can learn about our approach here.