Benchmarking Data Report
This is the performance of our GPU query engine, Theseus. Keep reading to dig into our SPACE (scale performance and cost efficiency) chart and see what the future of petabyte-scale data processing looks like.
Methodology highlights
Here is what you should know about how we benchmark our Theseus engine:
- Cold start: Benchmarks are presented with no warm up runs.
- From disk: All data derived from the TPC-H starts on the filesystem, in the open-source Parquet format.
- Unsorted: We do not sort or preprocess the data beyond loading it into Parquet files.
- No database-level caching: We do not use database-level caching for any of our Theseus runs.
Scale breakdown: CPU vs GPU across data sizes
We first introduced this SPACE (scale performance and cost efficiency) chart on our methodology page.
But tl;dr: the x-axis is the cluster cost per hour, so from left to right, we move from cheaper to more expensive. Total runtime is on the y-axis, so from top to bottom, we move from slower to faster. Bottom line: left and bottom means your dollar goes further – those are faster AND cheaper.
Let’s break this one down, because it packs a lot of information in a single plot. You will notice that we have four curves plotted here: one for every system and data size combination.
100TB: Your dollar always goes further
There are two points we are excited to call out in this plot: Theseus at 8 nodes handling 100TB of data, and Theseus at 10 nodes handling it even better.
First of all, the obvious: Theseus at 10 nodes processing 100TB is under both the Spark curves. While it is close to the 10TB line, that only means that for a similar cost and runtime of processing 10TB with Spark, you can process 100TB with Theseus. Furthermore, it is always cheaper than processing 30 TB of data with Spark. Your dollar is just worth more with Theseus.
This is what an accelerator-native system can do — completely change the data scale so 100TB workloads look more like 10TB, leveraging GPUs to their fullest with surrounding infrastructure that makes them shine.
Per query breakdown
To learn more about the performance of Theseus at 100TB, explore these plots to see how different queries perform differently and how performance is affected by workload.
10TB: GPUs can do more with less
To put these costs in context, in 2024, a Big Mac in the United States costs $5.69. A half gallon of organic milk costs $4.88. A 10TB query run on Theseus costs $2.19 (on 2 nodes), the cheapest run on Spark costs $77.99 (on 20 nodes).
-
Cost wall: With this plot, it is clear Theseus always costs less than Spark. It is also easy to see the increasingly diminishing returns with Spark: from 10 to 100 to 200 nodes the smaller each percentage returns. Increasing spend and acquiring more nodes, yields less and less performance improvements.
- Spark vs Spark: At the high end, getting speed is quickly not worth the cost. For Spark 10TB, the change in execution time from 100 nodes to 200 nodes is 32 minutes to 27 minutes, while the cost leaps from $128.90 to $213.70. Ultimately, this results in paying 66% more for 17% more speed.
-
Theseus vs Theseus: Scaling quickly reaches a limit, with barely noticeable cost increases for additional nodes: $2.19 to run all of the 10TB queries derived from the TPC-H on 2 nodes, $4.63 on 10 nodes. In this case, 2 nodes is adequate for the problem size.
-
Technical wall: Even with infinite money, scale cannot solve this problem with Spark. CPU performance is capped. No amount of money will jump over this wall.
- Theseus vs Spark: For this test, running Theseus at all is better than Spark at all points. In terms of cost: 2 Theseus nodes at 10TB complete the test at $2.19, as opposed to 200 Spark nodes at $213.70. Furthermore, there is no time tradeoff for reduced cost: 2 Theseus nodes finish 10TB of TPC-H in 5 minutes, as opposed to 200 nodes of Spark managing it in 27 minutes. This is a cost savings of 99%, while still going 80% faster.
30TB: It can pay off to scale up
When do you buy more hardware to accommodate larger data sizes? At 10TB, 2 Theseus nodes might be enough. At 30TB, matters become more interesting. Scale is needed for either system to quickly process this data.
- Minimum node counts: For example, Theseus at 2 nodes takes 23.5 minutes to complete this workload. Meanwhile, Spark at 10 nodes yields 12 hour execution time.
-
Maximum node counts:
If we push the systems to their maximum, we are in a more reasonable time-scale: 5 minutes for 10 nodes of Theseus, and 58 minutes for 200 nodes of Spark. The decision becomes even clearer when you compare costs: the Theseus run costs $10.10, while the Spark run costs
$458.49.
30TB | Runtime | Total cost | |||||||
---|---|---|---|---|---|---|---|---|---|
Minimum node counts |
|
|
|
||||||
Maximum node counts |
|
|
|
Query breakdown: If you are doing large joins, GPUs can help
One of the most common questions we hear is “When do GPUs make a difference for data processing?” The answer is basically across every query we tested, but the differences vary based on query.
In the heatmap, you can focus on different types of queries derived from the TPC-H based on their primary operations. There are plenty of ways to categorize types of queries — we are sharing one way that we find useful when comparing CPU and GPU systems.
-
Heavy Joins: Tests the engine’s ability to pull and analyze data from multiple tables to create a final dataset, which strains parallel processing and the ability to hold and move large volumes of data for a given system.
-
Heavy Sorts: Like joins, sorts perform large data movement and distributed analysis, with the main difference being that algorithms perform more computation over a single table, as opposed to data compilation over multiple tables.
-
Heavy Aggregations: Whereas sorts and joins emphasize data movement, stressing network and I/O systems, aggregation is intensely compute heavy, and relies almost exclusively on a system’s ability to process large volumes of data to yield singular outputs.
Table: Query descriptions
The following table summarizes the 22 queries derived from the TPC-H. We include our query categories to help you interpret the heatmap. Note that where a join is followed by a number in parentheses, that number is how many joins are in the query.
Query Number | Short Title | Query Description | Key Operations | Category |
---|---|---|---|---|
1 | Pricing Summary Report Query | Reports the amount of business that was billed, shipped, and returned. | SUM, COUNT, AVG, WHERE, GROUP BY, ORDER BY | Heavy Aggregation |
2 | Minimum Cost Supplier Query | Finds the cheapest suppliers of a part in a specific region, considering discounts. | Subquery, INNER JOIN (7),WHERE, MIN, ORDER BY | Heavy Sort |
3 | Shipping Priority Query | Lists the most profitable customers in a given time period, focusing on customers who have placed a significant number of orders. | SUM, INNER JOIN (2), WHERE, GROUP BY, ORDER BY | Heavy Join |
4 | Order Priority Checking Query | Identifies orders that were placed but not yet shipped within a specific time frame, highlighting potential delays. | COUNT, WHERE, GROUP BY, ORDER BY | Heavy Join |
5 | Local Supplier Volume Query | Calculates revenue for each nation within a region, generated from sales, emphasizing the effectiveness of regional promotions. | SUM, INNER JOIN (5), GROUP BY, ORDER BY, INTERVAL | Heavy Join |
6 | Forecasting Revenue Change Query | Forecasts revenue for a given year based on discounts applied to sales of a particular item. | SUM, WHERE, INTERVAL | Heavy Aggregation |
7 | Volume Shipping Query | Determines the volume of goods shipped between nations within a certain time period, highlighting international trade patterns. | SUM, INNER JOIN (5), GROUP BY, ORDER BY | Heavy Join |
8 | National Market Share Query | Assesses the impact of national market share on a specific part type, focusing on how changes in one nation affect sales in others. | SUM, INNER JOIN(7), WHERE, GROUP BY, ORDER BY | Heavy Join |
9 | Product Type Profit Measure Query | Evaluates the profit generated from a particular part, considering the entire supply chain from raw materials to final sale. | SUM, INNER JOIN (5), GROUP BY ,ORDER BY | Heavy Aggregation |
10 | Returned Item Reporting Query | Identifies the most profitable customers for a specific time period, based on the value of their orders. | SUM, INNER JOIN (3), GROUP BY, ORDER BY | Heavy Sort |
11 | Important Stock Identification Query | Ranks suppliers by the volume of parts they have provided, relative to the total volume available from that nation. | SUM, Subquery, GROUP BY, HAVING, INNER JOIN (2), WHERE, ORDER BY | Heavy Aggregation |
12 | Shipping Modes and Order Priority Query | Examines shipping modes and order statuses to determine the efficiency of the order-processing system. | SUM, WHERE, GROUP BY, INNER JOIN (1), ORDER BY, INTERVAL | Heavy Join |
13 | Customer Distribution Query | Counts the number of customers and their orders, focusing on those with specific characteristics that might indicate customer loyalty or issues. | LEFT OUTER JOIN (1), COUNT, GROUP BY, ORDER BY, | Heavy Aggregation |
14 | Promotion Effect Query | Calculates the revenue impact of promotional discounts on sales within a certain period. | SUM, WHERE, JOIN (1), CASE WHEN, INTERVAL | Heavy Aggregation |
15 | Top Supplier Query | Identifies the top supplier in terms of revenue contribution over a specific time frame. | Subquery, WHERE, INTERVAL, GROUP BY, MAX, SUM, INNER JOIN (1), ORDER BY | Heavy Join |
16 | Parts/Supplier Relationship Query | Analyzes the demand for parts based on characteristics like brand, type, and size, identifying potential inventory adjustments. | SCOUNT, INNER JOIN (1), WHERE, GROUP BY, ORDER BY | Heavy Sort |
17 | Small-Quantity-Order Revenue Query | Examines the average quantity ordered for parts, highlighting potential issues with inventory levels or pricing strategies. | SUM, INNER JOIN (1), WHERE | Heavy Aggregation |
18 | Large Volume Customer Query | Lists large volume customers and their orders, focusing on the relationship between order size and customer importance. | SUM, INNER JOIN (2), GROUP BY, ORDER BY | Heavy Join |
19 | Discounted Revenue Query | Assesses the profitability of sales for three different types of products, considering various discounts and shipping modes. | SUM, WHERE (MULTIPLE AND’s), JOIN (1) | Heavy Join |
20 | Potential Part Promotion Query | Identifies suppliers with inventory to meet a specific large order, considering the lead time required for parts not in stock. | INNER JOIN (1), WHERE, ORDER BY, INTERVAL | Heavy Aggregation |
21 | Suppliers Who Kept Orders Waiting Query | Examines suppliers at risk of losing business due to delayed orders, focusing on supplier reliability. | COUNT, LEFT OUTER JOIN (3), WHERE, GROUP BY, ORDER BY | Heavy Join |
22 | Global Sales Opportunity Query | Identifies potential sales opportunities by matching customer lists against order histories, highlighting areas for marketing focus. | Aggregate functions (SUM, COUNT, AVG), WHERE, GROUP BY, ORDER BY | Heavy Join |
Show more
Bottom line: Cost savings
If you want to just skip straight to the bottom line, in this plot we highlight the cost savings between Spark and Theseus. We invite you to explore the differences between each system across queries at different node counts.
Conclusion
Performance is of course important, but it is not enough. This is one of the reasons we advocate for and build open standards-based composable systems that give you performance, plus a system that is: modular, interoperable, customizable, and extensible.
-
No coupled interfaces: Projects like Ibis, Substrait, and SQLGlot provide interface abstraction so that when you write Theseus code you are writing code that supports, and seamlessly ports to, many different query engines, databases, and data processing frameworks.
-
No proprietary storage formats: Being built on top of Arrow means that Theseus reads data where it lies, whether that is in file formats, databases, data warehouses, or anything that supports Arrow and ADBC. (Arrow Database Connectivity).
Leaders in the data analytics industry (Andy Pavlo, Bucky Moore, Jordan Tigani) are all saying the same thing: due to open standards, OLAP engines are being commoditized, and barring architectural differences, performance across engines will converge over time.
This is why we propose a fundamentally novel distribution strategy built for an accelerator-native future. For the biggest data challenges, building a composable data system means you can choose the right tool for the job.
When it comes to scale:
-
For less than 2TBs: We believe DuckDB and Arrow backed projects, DataFusion, and Polars make a lot of sense. This is probably the majority of datasets in the world and can be run most efficiently leveraging these state-of-the-art query systems.
-
For up to 30TBs: Well-known data warehouses like Snowflake, Google BigQuery, Databricks, and distributed processing frameworks like Spark and Trino work wonders at this scale.
-
For anything over 30TBs: This is where Theseus makes sense. Our minimum threshold to move forward requires 10TB queries (not datasets), but we prefer to operate when queries exceed 100TBs. This is an incredibly rare class of problem, but if you are feeling it, you know how quickly costs balloon, SLAs are missed, and tenuously the data pipeline is held together.
If you are feeling data processing pain on a petabyte-scale, we are here to help.
For a deeper dive on how we collected the data presented in this report, check our About page to learn more or read about our methodology.