Velox: Enabling Modular and Composable Query Processing for the Apache Arrow Ecosystem

Wes McKinney Aug 31, 2022
Velox and Voltron Data

Voltron Data recently announced its collaboration with the new open source Velox project (you can also watch the Velox session from the Data Thread Conference). We aim to work together to enable modular, composable accelerated query processing that aligns well with the rest of the open source Apache Arrow ecosystem. Today, the Velox team announced their support of these efforts on the Engineering at Meta Blog.

Velox is a C++ vectorized database acceleration library providing optimized columnar processing decoupling SQL or data frame front end, query optimizer, or storage backend. Velox has been designed to integrate with Arrow-based systems. Through our collaboration, we intend to improve interoperability while refining the overall developer experience and usability, particularly support for Python development.

Since the mid-2000s, column-oriented data processing has become one of the most widespread and successful approaches to building scalable, cost-effective data processing systems in open source and commercial databases. Popular modern analytic SQL systems, such as BigQuery, ClickHouse, DuckDB, Redshift, Snowflake, and Vertica, were designed around column-oriented vectorized processing to achieve fast and resource-efficient query performance. Other systems have evolved to incorporate columnar processing as an accelerator, such as Photon project for Apache Spark SQL. 

While many systems had shown the performance benefits of columnar data processing on modern computing hardware, data interoperability issues and composability of data engines continued to plague the big data ecosystem. In response to this, a large group of open source developers created Apache Arrow to develop technology standards to unify columnar data processing in 2015. Arrow provides a standardized language-independent columnar data representation together with a collection of open source building blocks for creating high-performance columnar data processing systems. Arrow has helped simplify data interchange across programming languages and processing frameworks while enabling data processing engines to be more modular and reusable. Without a standardized columnar format, data must undergo serialization steps, and algorithms must often be rewritten. 

We founded Voltron Data in 2021 to bring together developers working on accelerated computing technologies for the Apache Arrow ecosystem. In addition to making major investments in the Arrow ecosystem, we are developing enterprise solutions at the intersection of programming languages, computing hardware, and developer experience. To this end, we have also aligned our work with the ongoing innovation in columnar database systems. Voltron Data recently joined the DuckDB Foundation to fund the integration work between DuckDB and Apache Arrow. Due to this work, the community uses DuckDB as a complementary and modular query engine with zero-copy data interchange between other Arrow-enabled libraries.

Velox is led by developers from Meta and collaborators including Ahana, Intel, and Voltron Data. It is a modular C++ framework designed as a set of high-performance, reusable and extensible data processing components. These components are used in engines such as PrestoDB and Spark for more efficient SQL processing. Velox’s routines are used for faster feature engineering and data pre-processing in PyTorch workloads. Velox’s goals focus on portability and reuse across different types of systems. These goals align well with the Arrow community’s ideas of modularity and composability. 

We strongly believe that the Apache Arrow and Velox developers will deliver modular and composable accelerated data querying solutions that help make processing columnar data more efficient. At Voltron Data, we look forward to seeing what the two communities achieve.