Relentlessly Improving the Performance of our GPU Query Engine, Theseus

Theseus is Built for Accelerated Systems

NVIDIA does not sell GPUs, they sell accelerated systems, combining processors, networking, storage, and software to win every benchmark.

It’s for this reason Theseus, our distributed query execution engine, is accelerator-native. To take full advantage of NVIDIA hardware, we also integrate a lot of NVIDIA software (e.g., RAPIDS libcudf, NVComp, GPUDirect Storage, and more) gaining outsized performance gains and a long list of opportunities to get better.

Performance Over Time

We announced Theseus to the world almost a year ago at HPE’s Discover in Barcelona, and at NVIDIA’s GTC 2024 in March this year we showed our first public benchmarks and our views on honest benchmarking. Leading up to GTC, and continuing ever since Theseus’ performance has been rapidly improving.

At Voltron Data we take benchmarking seriously. Our pre-merge benchmarks include a full suite of TPC-H SF10K (10TB), and nightly we benchmark TPC-H SF100K (100TB) along many other benchmarks to ensure the engine only gets faster and cheaper.

Ten Terabytes (SF10K)

If we look at the total benchmark runtimes for TPC-H SF10K (10TB) on 2 servers (DGX A100 640GB) you can see that since GTC 2024 performance has almost doubled, going from 9.5 minutes to under 5.5 minutes, bringing the cost of the benchmark run from $3.54 to $2.19.

The image is a scatter plot titled “10 TB Performance Over Time – 2 DGX A100 80GB.” It displays performance data (runtime in seconds) over time from January 2024 to July 2024. The y-axis (runtimes) ranges from 0 to 2000 seconds, and the x-axis represents dates. The graph shows initially higher runtimes around 1500 seconds, which gradually decrease over time. By April 2024, runtimes stabilize below 500 seconds. A notable label marks “NVIDIA GTC 2024” around April, possibly indicating a significant event. The scatter points become denser and more consistent after this point.

One Hundred Terabytes (SF100K)

We’re even more excited about the changes we’ve seen in TPC-H SF100K (100TB). When looking at 10 servers (DGX A100 640GB). Not only did we almost double our performance, but we were able to improve memory usage to scale down the number of servers needed to do a full 100TB run.

The image is a scatter plot titled “100 TB Performance Over Time – 10 DGX A100 80GB.” It displays performance data (runtime in seconds) for processing a 100-terabyte workload using 10 nodes (80 GPUs). The x-axis shows the time from March 2024 to September 2024, while the y-axis ranges from 0 to 5000 seconds. Early in the timeline, the scatter points are above 4000 seconds, showing high runtimes. Around April 2024, labeled “NVIDIA GTC 2024,” the runtimes decrease and stabilize between 1000 and 2000 seconds. After this point, the data points become more consistent, indicating improved and more stable performance as time progresses. The graph suggests a significant performance improvement following the NVIDIA GTC 2024 event.

These performance gains brought the cost of a 100TB run down from $86 to $43.44 at 10 nodes.

What’s Next

We still have a long way to go with accelerator-native performance, and you can expect to see more posts like this in the coming months where we continue to push the state-of-the-art.

We’ve been doing this for a long time, and our CEO said it perfectly in the original Relentlessly Improving Performance post 4 years ago.

“By considering both software and hardware holistically, the RAPIDS team is making the vast potential of accelerated computing accessible to data practitioners across industries and institutions.”
-Josh Patterson, Co-Founder and CEO of Voltron Data

We couldn’t agree more. To see the latest Theseus benchmarks, please visit: voltrondata.com/benchmarks/theseus

Theseus is Built for Accelerated Systems

NVIDIA does not sell GPUs, they sell accelerated systems, combining processors, networking, storage, and software to win every benchmark.

Performance Over Time

Ten Terabytes (SF10K)

One Hundred Terabytes (SF100K)

These performance gains brought the cost of a 100TB run down from $86 to $43.44 at 10 nodes.

What’s Next

We still have a long way to go with accelerator-native performance, and you can expect to see more posts like this in the coming months where we continue to push the state-of-the-art.

We’ve been doing this for a long time, and our CEO said it perfectly in the original Relentlessly Improving Performance post 4 years ago.

“By considering both software and hardware holistically, the RAPIDS team is making the vast potential of accelerated computing accessible to data practitioners across industries and institutions.”
-Josh Patterson, Co-Founder and CEO of Voltron Data

We couldn’t agree more. To see the latest Theseus benchmarks, please visit: voltrondata.com/benchmarks/theseus

Relentlessly Improving the Performance of our GPU Query Engine, Theseus

Theseus is Built for Accelerated Systems

Performance Over Time

Ten Terabytes (SF10K)

One Hundred Terabytes (SF100K)

What’s Next

Go Inside Arrow Database Connectivity: Roadmap, Background & Community

Ready to see this on your data?

Related Posts

Relentlessly Improving the Performance of our GPU Query Engine, Theseus

Theseus is Built for Accelerated Systems

Performance Over Time

Ten Terabytes (SF10K)

One Hundred Terabytes (SF100K)

What’s Next

Go Inside Arrow Database Connectivity: Roadmap, Background & Community

Ready to see this on your data?

Related Posts

Getting Started