Nov 01, 2022
The Takeaway: Go and Apache Arrow at ApacheCon ‘22
Matt Topol
In early October, open source engineers and developers gathered in New Orleans, LA, for ApacheCon North America. Over the course of four days, 160+ sessions took place covering topics like big data, IoT, geospatial, fintech, Apache Cassandra, and more.
It was a big event – with a far reach. People from China, France, Canada, Bulgaria, the United States, and beyond came to learn about Apache open source projects and hear from the companies using them. Matt Topol, Software Engineer at Voltron Data, also attended. “It was a fascinating pulse check on the people developing and using Apache projects,” Matt said of his experience. “It was one of those moments that made me stop and realize: this is where I belong.”
Matt was there to present his work with the programming language Go in the Apache Arrow ecosystem. When we sat down to talk with him, our intention was to focus on just that. But, as the discussion went on, Matt lit up when talking about the off-the-cuff conversations he had with people – in hallways, at lunch, in the speakers’ room, and at the bar during happy hour. So, in this post, you will learn about Go, and also gain a view into where Apache Arrow – the six-year-old open source project gaining significant traction – stands in the open source community today.
Tell us about your experience at ApacheCon. How did you get involved as a speaker?
ApacheCon was awesome! I’ve presented at conferences before, but this was my first time presenting at a conference in person — and I realized I definitely missed in-person conferences. I submitted a talk titled “Apache Arrow and Go: A Match Made in Data”. Lots of people talk about the Python module (pyarrow), the C++ library, and the Java library, but the Go module is lesser known. I wanted to bring attention to the ease and benefits of using Apache Arrow with Go. Tell us about what’s happening with Go and Apache Arrow.
Tell us about what’s happening with Go and Apache Arrow.
The initial implementation of the Go Arrow module was donated by InfluxDB several years ago. At my previous company, I found significant performance benefits to leveraging the Go Arrow module in the work I was doing, which led me to start contributing fixes and new features to it.
With the release of Arrow v10.0.0, the Go module now supports all of the Arrow format’s data types, the Apache Arrow Flight and FlightSQL protocols, and even has an extremely efficient Parquet implementation. My current goal is to successfully build an active Go data science community around the Arrow module.
What was the community’s reaction to your presentation?
Going into the conference, I realized that my talk was the only one dealing with Apache Arrow and also the only one that dealt with Go. So I figured I’d get one of two situations, either no one would attend or I’d have a full room, haha. People definitely were interested in the talk, but by the end, more people asked questions about Arrow itself than specifically about the Go implementation.
Interesting. Tell us more…
During the Q&A at the end of the talk, I got a few questions about using specific features of the C++ or Java Arrow libraries, or how some specific functionality is implemented in the libraries. Honestly, I was surprised by how few people were familiar with Apache Arrow. When you’re at a professional conference, you’re frequently introducing yourself to others with the usual, “where do you work?”, “what do you work on?”. As a result of all those in-between conversations, I got a strong sense of what people did or didn’t know about Arrow.
What didn’t people know about Arrow?
For starters, many people had heard of Arrow but didn’t know what it was. In general, Apache Arrow is two things: a columnar in-memory format and a collection of implementations of that format as libraries for a variety of languages. It’s particularly helpful for most use cases involving tabular data, ranging from analytics and ML workflows to distributed data processing systems.
If nothing else, I hope that during my conversations with people I was able to get Arrow’s versatility and use cases across. For instance, Arrow’s libraries have enabled many larger-than-memory workflows to be performantly run on local workstations rather than having to pay for cloud big data tools.
How could the Open Source community benefit from using Arrow?
Most Apache projects discussed at ApacheCon are significantly older than Arrow. Apache Spark moved to the Apache Software Foundation in 2013, Cassandra was initially released in 2008, and Solr was released in 2004. Given that Arrow is comparatively recent, it’s easy to spot lots of opportunities for Arrow in the community. Even though Arrow’s primary use case is super efficient analytical operations that take advantage of modern hardware (using SIMD, and optimizing for GPUs), its multitude of zero-copy interfaces provides fast data access and transport both within, and between components of data systems.
The more projects that support Arrow as a standard, and potentially gain performance benefits by adopting it internally, the easier it will be for developers trying to integrate disparate systems together. From the open source perspective, having an efficient in-memory and transport format like Arrow reduces lock-in to technologies. It facilitates being able to mix and match components to utilize whatever makes the most sense for the use case without having to sacrifice performance to serialization and deserialization at every component boundary.
Reflecting back, what did you enjoy the most?
Before I left, my publisher sent me five copies of my book, “In-Memory Analytics with Apache Arrow”. So every morning, I threw one in my backpack with the intention to hand it out. This was my way of spreading the Arrow love.
Overall, I went there to educate people and make the Go and Arrow library more well-known, and it turned out that it wasn’t just the Go library that needed to be known. It’s the Arrow Format and libraries in general – and the work Voltron Data is doing to push standards forward with Arrow.
Matt at ApacheCon with a giveaway copy of his book.
Please tell us you signed the copies.
Of course!! I carried a pen with me every day so I could sign on the fly.
To learn more about Matt, you can follow him on Twitter or read his book “In-Memory Analytics with Apache Arrow”. We recommend reading “Apache Arrow: driving columnar analytics performance and connectivity” by Wes McKinney, co-creator of Apache Arrow and Voltron Data’s CTO.