12 Open Source Projects to Watch in 2023 · Voltron Data
Significant investment continues to pour into the open source ecosystem, from individual contributors worldwide and from companies like Meta, Anaconda, NVIDIA, Google, and more. The pursuit of innovation across the data community is remarkable – and we are grateful for the spirit and energy toward constant improvement and innovation.
Voltron Data contributes to numerous open source projects including Apache Arrow, Substrait, Ibis, RAPIDS, Velox, and more. We deeply believe that supporting open source standards will ensure freedom and flexibility for the future of data systems.
As we look ahead to 2023, we want to spotlight 12 open source projects that are paving the way for innovation in the data analytics ecosystem. Some of these projects are well-established and others are emerging. All are on a mission to tackle pain points developers and engineers face in data science, machine learning, and other corners of today’s complex data analytics landscape.
To generate this list, we pooled insights and perspectives from Voltron Data’s leadership, product, engineering, and developer relations teams. We dug deep into project pages, community forums, and GitHub to identify the projects we think will fundamentally change the way we work with data. This list is not intended to be exhaustive. We hope it inspires you to discuss, honor, and uncover other exceptional work happening in the open source data community.
In 2022, PyTorch cemented its position as one of the most important and successful machine learning software projects in the open source landscape.
Keep an eye on TorchArrow
· · · · · ·
The TorchArrow library (currently in beta) focuses on data preprocessing and is optimized for hardware accelerators like GPUs. Importantly, it was built using Apache Arrow’s columnar memory layout with nested data support (such as string, list, map) and Arrow ecosystem integration.
Learn more:TorchArrow on Github
TorchArrow in 10 Minutes tutorial
PyTorch 2.0 was released in December 2022 at the inaugural PyTorch Conference, with massive performance improvements and new integrations. A key highlight is `torch.compile` which moves C++ operators back into Python for more performance and flexibility. Serious gains were made with PrimTorch which reduces ~2,000+ PyTorch operators down to ~250 primitives that developers can use to build a complete PyTorch backend.
We’re also watching what’s happening with shared memory pools. With the pluggable NVIDIA CUDA allocator backend users can change allocators dynamically during code execution. This is poised to solve a longstanding problem with memory allocation and efficiency; it makes RAPIDS Memory Manager (RMM) able to share memory pools across multiple libraries, specifically, RAPIDS, XGBoost, and PyTorch at this point.
Something important to celebrate is that PyTorch joined the Linux Foundation, ensuring neutrality and true community ownership. We expect this to support the project's growth and create more transparency and open governance for the framework.
This is a move we stand behind and look forward to watching the community continue to innovate in this space.
Ibis is a modular framework that helps developers connect to, transform, and pull data that can be represented as tabular data. Downloads for the project have skyrocketed over the last two years - and it may be on its way to becoming the standard tool for users who work with tabular data.
Ibis connects to
10+ QUERY ENGINES
This positions it as the data scientist API of choice for communicating across engines. The backends translate effectively across each other.
Reading and transforming files using DuckDB locally translates really well to the same operations and concepts you would use to read tables in Google’s BigQuery - and the way you write Ibis code across backends are portable.
In 2022, the community saw three Ibis releases. Some standout features include:
- Introduction of the DuckDB backend which makes it easier to work with local files and data
- Underscore API which allows users to easily chain Ibis expressions
- Memtables which increases the performance of aggregations on pandas DataFrames and can be joined to regular Ibis TableExpressions
Ibis 4.0 is slated for release in January 2023. With it will come features like the Read API, allowing users to quickly read tabular data files (CSVs, Parquet, Text) for fast exploratory analysis using DuckDB, and to_pyarrow methods, which allows users to output result sets to pyarrow types (such as Tables and RecordBatches).
Learn more about Ibis
- Ibis Project Page
- Ibis on GitHub
- Blog Series: “Ibis Explained” Part 1 and Part 2
- Use case video from The Data Thread by contributor, Marlene Mhangami: "Navigating the San Francisco Art Scene with Ibis" (15-minute watch)
- Use case video from The Data Thread by Ibis Product Manager, Patrick Clarke: "Querying Sales Data with Ibis" (15-minute watch)
Velox is a low-level technology with high-impact potential.
· · · · · ·
A rip and replace of Java-based engines with Velox yields a 3X performance improvement.
It was released by Meta in August 2022 with contributions by Ahana, Intel, and our team at Voltron Data. Velox uses an Apache Arrow-compatible memory layout and is designed to speed up data management systems and streamline development. It addresses feature engineering, data preprocessing, and other rapidly growing artificial intelligence (AI) and machine learning (ML) use cases.
What’s important to highlight is that Velox is built from the ground up to work on modern CPUs. This is crucial for the advancement of AI and ML. To date, you’ll see work done around the latest Intel CPUs with Advanced Vector Extensions (AVX).
There are a number of things that we appreciate about Velox. For one, the spirit of the project is to help prevent development churn. Velox centralizes compute, helping engineers connect with engines faster so they can focus their efforts on outcomes (for their apps, use cases, and customers) instead of reinventing the wheel on bespoke efforts.
Looking ahead, we’re excited to see that Velox is adding Substrait support which will enable a more standardized interface to tap into accelerated compute.
Learn more about Velox
- Velox on GitHub
- A introductory talk by Pedro Pedreira, Software Engineer at Meta, gave at The Data Thread: "Velox: An Open Source Unified Execution Engine" (22-minute watch)
- Blog: "Introducing Velox: An open source unified execution engine"
- Research Paper from VLDB: "Velox: Meta’s Unified Execution Engine"
DuckDB is an innovator in the data management space — if you haven’t heard of it by now, you certainly will soon. It is often described as a SQLite for analytics workflows, but offers embeddable, column-oriented data storage using vectorized processing to optimize OLAP workloads. That’s a mouthful. We like that the project offers zero-copy data integration with Apache Arrow, enabling rapid analysis for larger-than-memory datasets.
Developers are embracing DuckDB because it is easy to install, lightweight, and provides fast analytics. The stack is CPU-based and designed for single node. Because it also makes language library integrations simple, we’re seeing it pull in users from other traditional relational database systems like MySQL and Postgres.
· · · · · ·
MotherDuck is a new startup built entirely on DuckDB which recently raised $47.5M in Series A funding.
Some developments over this past year that excite us include:
- Data compression improvements result in reduced data set size by 75-95% with higher performance
- Integration with Postgres provides the ability to query Postgres tables directly
- New features like better date and time support for temporal analytic applications
Learn more about DuckDB
- DuckDB Project Page
- DuckDB on GitHub
- A talk Thomas Mock, Posit PBC, gave at The Data Thread: "Efficient Data Analysis on Larger-than-Memory Data with DuckDB and Arrow" (10-minute watch)
RAPIDS is a collection of open source analytics, machine learning, graph analytics, and visualization libraries that use Pythonic interfaces to expose C++ primitives for low-level compute optimizations on NVIDIA GPUs. At its core, the RAPIDS C++ dataframe library, libcuDF, is Arrow-native and designed from the ground up to leverage the Arrow columnar, in-memory data format to deliver efficient and fast data interchange. As explained in the Relentlessly Improving Performance blog post, RAPIDS is laser-focused on driving data preprocessing performance and has been doing so for a few years.
To extract valuable information from string data, RAPIDS libcuDF stores string data in device memory using Arrow format to accelerate string data transformations by more than 10X pandas. Some notable improvements that we’d like to acknowledge include the addition of user defined functions (UDFs) with strings.
10X faster performance
from efficient string UDFs in RAPIDS libcuDF
As mentioned in the PyTorch section, there have also been enhancements to the RAPIDS Memory Manager (RMM) library, which adds memory interoperability support with PyTorch. For those unfamiliar, RMM creates and shares a single memory pool on the GPU to optimize for computation latency issues. This new integration eases the handoff between libraries and allows for memory reuse between systems like RAPIDS, PyTorch, and XGBoost.
For data engineers and data architects, RAPIDS enabled zstd compression in ORC and Parquet readers and writers. More often than not, companies are adopting Zstandard as the way to compress their stored data.
RAPIDS Radar: Accelerating Geospatial Data Analytics
· · · ·
The RAPIDS library, cuSpatial, has made significant enhancements to the GeoSeries class. The base code was refactored to leverage Apache Arrow’s union array format, providing support for structure of array data storage for mixed geometry types and efficient access for GPU algorithms. For each of the sub-geometry types in the union array, cuSpatial now stores them in GeoArrow format. This is an extension data type based on Apache Arrow’s variable-size list layout. cuSpatial is the first implementation of GeoArrow on GPU.
Data analysts and data engineers feel the pain when it comes to the lack of standardization across their data stacks. Too often, data manipulation APIs and DSLs do not interoperate across compute engines, and compute engines do not interoperate across lower-level components such as compute primitives, query optimizers, hardware acceleration libraries, and computational storage systems.
Substrait is a cross-language, interoperable specification for data compute operations. It connects analysis tools with compute engines, and compute engines with underlying compute components, providing a standard, flexible way for all these layers of the stack to speak a common language. When paired with Apache Arrow, a universal standard for representing tabular data, users can achieve better performance in a way that is modular and composable.
The Substrait project has grown tremendously in 2022. The core specification — based on Protocol Buffers — is well established and can be represented as binary or JSON. Bindings and other integrations have been created in eight languages: C++, C#, Go, Java, Python, R, Ruby, and Rust. A formal governance structure has been created to ensure that the project can meet the needs of a diverse group of stakeholders.
In 2023, we expect to see more work done to make Substrait a mature, stable standard.
Three areas of the Substrait project we're watching:
· · · · · ·
We hope to see more functions defined in the core Substrait specification — arithmetic functions, string manipulation functions, aggregates, conditionals, and more — building on the substantial set of functions that is already defined there. Having a rich set of functions defined in the core spec will help applications minimize the use of custom extensions, enabling Substrait plans to be truly interoperable.
Query Engine Advancements
· · · · · ·
One big area of focus for 2023 is giving Apache Arrow-native query engines like Acero, DataFusion, DuckDB, Velox, and more the ability to speak Substrait, receive plans, execute them, and return Arrow-formatted results.
Python and R Compilers
· · · · · ·
We’re also tracking Substrait compilers. There is an Ibis Substrait compiler in the works which will allow Python users to write code in an analytics DSL, as well as a dplyr Substrait compiler for R users — both designed to execute compute plans on Substrait-compatible engines.
Sundeck, one of the companies contributing to Substrait, is also working on a Java library called Isthmus that bridges Substrait and Apache Calcite to enable users to translate SQL queries to Substrait plans. And the DuckDB project has followed suit, creating an extension that makes it possible to generate Substrait plans from DuckDB SQL queries, no JVM required.
Substrait: Rethinking DBMS Composability Presentation at VLDB 2022
Learn more about Substrait
- Substrait Project Page
- Substrait on GitHub
- Blog: "Introducing Substrait: An Interoperable Data to Engine Connector”
- A talk that Jacques Nadeau (Substrait originator and Sundeck CEO) gave at a workshop at the VLDB conference: “Substrait: Rethinking DBMS Composability” (30-minute watch)
- A talk that Ian Cook, Product Manager at Voltron Data, gave at The Data Thread: “Arrow and Substrait: Better Together” (15-minute watch)
Let us be clear about Apache Arrow: it is so much more than a data format. It is a toolbox (dare we say, treasure chest) for solving data problems.
Last year, Apache Arrow was downloaded millions of times - hitting 70 million downloads in one month - which is a testament to its robust capabilities and use.
Monthly PyArrow Downloads in Million
Three Apache Arrow projects that stand out:
Arrow Database Connectivity (ADBC)
This year, the Apache Arrow community started work on ADBC, a vendor-agnostic API for using databases with Arrow data, to fill in a gap in the ecosystem that JDBC/ODBC doesn’t quite cover. ADBC provides a generic API for executing queries, fetching metadata, and more that uses Arrow data. Database drivers then translate these calls for the underlying database. Arrow-native databases can pass through data without conversion, while drivers for other databases can convert the data. In either case, the application just gets Arrow data.
ADBC is new, and work is still ongoing. In 2023, we expect to see drivers for more databases, and better optimization of existing drivers.
ADBC - the API to connect clients with databases,
engines, and storage
Source: Dr. Alison Hill, Director of Knowledge, Voltron Data
Apache Arrow Flight SQL
Apache Arrow Flight SQL was released this year as a new client-server protocol for interacting with SQL databases. It makes use of the Arrow in-memory columnar format and the Flight RPC framework. By using Arrow Flight SQL, a database can serve ADBC, JDBC, and ODBC users all from a single endpoint. Thanks to drivers developed by the Arrow community, there’s no need to write your own drivers, either.
In 2023, we expect to see more feature support in the JDBC/ODBC drivers, and better support for catalogs in the protocol.
Arrow Flight SQL - 20x faster compared to PyODBC
Source: Dr. Alison Hill, Director of Knowledge, Voltron Data
Apache Arrow DataFusion
Arrow-native data engines like DataFusion showcase Arrow’s ability to help developers stand up and build integrated data engines that serve specific needs using common tools and standards. This ensures interoperability up and down the stack. We’re seeing Datafusion adoption pick up. Projects like ROAPI are working on query frontends to translate SQL, GraphQL, and REST API queries into DataFusion plans. Also, Influxdata recently announced their new InfluxDB IOx storage engine which is built on top of Arrow and DataFusion.
Learn more about Apache Arrow
- Apache Arrow Project Page
- Apache Arrow on GitHub
- Blog: "Simplifying Database Connectivity with Arrow Fight SQL and ADBC”
- A talk Randy Zwitch gave at The Data Thread: “All in on Apache Arrow” (7-minute watch)
- A talk Henry Ehrenberg, Co-Founder of Snorkel AI, gave at The Data Thread: “Powering Data-Centric AI with Arrow” (22-minute watch)
- A talk David Li, Software Engineer at Voltron Data, and James Duong, Senior Staff Software Engineer at Improving, gave at The Data Thread: “Arrow Flight SQL: Accelerating Database Access” (19-minute watch)
- A talk a Andrew Lamb, Staff Engineer at InfluxData, gave at The Data Thread: “Apache Arrow and Data Fusion: Changing the Game for Implementing Database Systems” (15-minute watch)
Polars is pushing the bar for how much you can get done and how much data you can use on your own computer.
Ritchie Vink launched Polars as a fast dataframe library built on Apache Arrow that utilizes all available cores on developers’ machines. Written in Rust, Polars is designed for the parallelization of queries on dataframes. The community is embracing it as an alternative to pandas because it solves many Python developers' problems when it comes to memory limitations. It is a performant, single-node engine that helps developers scale in the face of massive datasets.
Polars TPC-H Benchmark Results Including Reading
Learn more about Polars
- Polars Project Page
- Polars on GitHub
- Blog: "The Expressions API in Polars is Amazing”
- A presentation Ritchie Vink gave at Databricks’ Data + AI Summit 2022: "Polars: Blazingly Fast DataFrames in Rust and Python” (37-minute watch)
Redpanda is an open source project that truly puts performance and standards at its core. They’re on a mission to bring simplicity to the complexity of high-performing streaming systems – and are succeeding. Redpanda is eclipsing Kafka as a simple, fast alternative.
lower tail latencies on existing hardware
fewer resources used
Redpanda is written in C++ using a thread-per-core model that maximizes utilization of modern hardware. They are rebuilding Kafka using vectorized C++ primitives to drive performance and plumb back up the stack in useful ways. Redpanda is the under the hood streaming engine for many intelligent edge applications and tools. We see them as being aligned with the work Apache Arrow and Velox are doing: it’s low-level, high-impact work that puts performance, standards, and interoperability at the core.
Learn more about Redpanda
While not a new player on the scene, Hugging Face continues to be a serious innovator in the machine learning (ML) space. Particularly, the platform enables users to work with Arrow datasets and tools like NumPy, pandas, PyTorch, and TensorFlow.
Hugging Face as a number of robust libraries and these three are ones we’re particularly excited about:
· · · · · ·
Hugging Face datasets use Apache Arrow for memory mapping, allowing them to be backed by an on-disk cache for faster lookup.
· · · · · ·
Accelerate enables the same PyTorch code to run across any distributed configuration with just four lines of code. It allows developers to train and use PyTorch models with multi-GPU, TPU, mixed precision.
· · · · · ·
Transformers provides APIs and tools that train state-of-the-art pre-trained ML models. This has many benefits for users, including the ability to reduce compute costs, carbon footprint, and the time/resources required to train models from scratch.
Hugging Face is more than modern tools, though. It is a vibrant community of data scientists, researchers, and ML engineers dedicated to advancing AI. We applaud their knowledge sharing; their documentation is Grade-A and they have a well-oiled YouTube Channel full of demos and tutorials.
Today, organizations like Meta, Intel, Google, Microsoft, and more use Hugging Face in their workflows, and we suspect adoption across the enterprise landscape will only increase in the future.
Just landing on the scene from Google, Malloy is a new query language for relational algebra and is gaining interest in the data analytics space outside of mainstream programming languages.
Malloy is experimental, giving data practitioners new(ish) ways to compose nested data. Each query a developer writes is a building block to unlock the next level of understanding — so the more you build, the more you understand.
What we like about Malloy is that it is built for composability. It does this through:
Learn more about Malloy
A Project Ahead of its Time
· · · · · ·
Numba started about 10 years ago with a focus on performance – before performance was cool.
Compared to some of the other open source projects on the list, Numba is the “elder” of the bunch. But it rightfully belongs here. The interoperability that Numba brings to the ecosystem supports a very broad set of use cases that are still relevant and useful.
Numba takes Python code and compiles it to give you the speed advantage of writing in Python while giving you the execution speed and parallelism of writing in C. It’s compatible with NumPy and CuPy and offers memory management that you can use for managing GPU and CPU arrays.
Numba continues to innovate. Recently, Numba added on disk caching support for NVIDIA CUDA kernels. Just-in-time (JIT) compilation can be computationally expensive, but caching the result can significantly reduce overhead and lead to much better performance overall. Numba stands the test of time, it’s reliable and continues to get the job done.
Learn more about Numba
- Numba Project Page
- Numba on GitHub
- A talk John Murray, Director of Fusion Data Science, gave at The Data Thread "Using Arrow with Numba Kernels to Generate AI Workflows" (25-minute watch)
- Blog "Extending Numba’s CUDA target with the high-level API"
Voltron Data Founders Look Ahead
In the coming year, we believe enterprises should start abandoning bespoke development efforts from scratch and turn to open source projects that fundamentally change the way data analytics are done. Let’s face it, companies using distributed machine learning (ML) systems are in a tough position. With the adoption of hardware accelerators, system performance for training has skyrocketed, while a majority of data preprocessing is hamstrung by legacy software systems that can’t leverage accelerators. The disparity is undeniable. As the divergence widens, ML training capacity will be limited by the amount of data companies can preprocess.
This challenge isn’t unique. Companies like Meta, Netflix, GE Healthcare, and more are investing millions to develop purpose built systems to solve this gap. These bespoke solutions are very difficult to build and expensive to maintain. And get this: many of these companies are running into many of the same problems, building similar systems to tackle similar problems. Wouldn’t it make sense to all work from a common set of standards and tools? This could save time, money, and valuable resources.
We believe that open source standards drive system interoperability, modularity, and freedom. And we aren’t alone. Open source projects like Velox, Substrait, Apache Arrow, and RAPIDS, to name a few, are fundamentally trying to move the industry in the same direction – language agnostic, high-performance software that can leverage modern hardware more effectively. Adopting these standards lessens the burden of building preprocessing systems from scratch and focuses on the core business. To learn more about this new wave of standards and the projects we support, check out our Resources or watch talks from The Data Thread.
We understand that working with open source tools can sometimes be difficult and challenging. From answering technical issues on GitHub to partnering on developing core system technology, we are here to help as little and as much as you require. Please reach out to firstname.lastname@example.org with any questions or needs you may have.