FAQ & System Specs

Data Report

FAQ

1.How did you benchmark both systems?

2.I want more details! How exactly did you benchmark both systems?

3.Why did you choose benchmarks derived from TPC-H?

4.Why did we only benchmark these two systems: Theseus and Spark?

5.Are your benchmarks official?

6.Why did we use a third party for our Spark benchmarks?

System Specs

In this report, we made a number of important decisions to create useful benchmarks to evaluate the performance of our proprietary engine software, Theseus.

Selected data systems

We benchmarked two data systems: Theseus and Apache Spark.

Theseus is a distributed, accelerated engine developed by Voltron Data (us). The founders of Voltron Data are also the founders of open source accelerated systems including BlazingSQL and Rapids. Theseus is designed to be an accelerator-native engine, meaning it was developed for targeted usage with GPUs. We ran our own benchmarks for Theseus internally, relying on the Performance Engineering team at Voltron Data.

Apache Spark is an open source, distributed computing system that processes large amounts of data across many CPUs. It speeds up data processing by performing computations in memory in parallel. Spark supports various data analysis tools for SQL, machine learning, and streaming data. We contracted with a third-party entity, Concurrency Labs, to run our Apache Spark benchmarks. We did this to guarantee that Spark was evaluated under the best possible conditions. For more detailed specifics about how Concurrency Labs benchmarked Apache Spark, we refer you to their blog post.

These two systems were selected in order to directly compare our own distributed, accelerated engine (Theseus) to a distributed engine built for data processing on CPUs (Apache Spark).

We considered benchmarking other data systems that we ultimately decided to exclude from this write-up. Specifically, we considered Spark with Gluten, advertised as an open source plugin to double Spark's performance. We wanted to use Amazon EMR as we did with our vanilla Spark benchmarks. However, the managed service did not natively support Gluten. Attempting alternative methods presented insurmountable build issues, and we had to abandon our effort to benchmark Spark with Gluten.

If there is another system you want to run, we invite you to. In our technical appendix, we have included all the information on the data and queries we used, and you can set it up with your preferred system and see how it stacks up. If you do, please let us know: https://voltrondata.com/contact

Hardware

Theseus benchmarks were conducted using a 10 node Lambda Labs GPU machine, with the following configuration per node:

Motherboard: SuperMicro X12DGO-6
Compute: 8 x Nvidia SXM4 A100 GPUs 80GB VRAM
Memory: 4 TB (32x Samsung M393AAG40M32-CAE 128 GB 3200 MHz ECC RAM)

The Theseus system, including details for networking and storage:

Infiniband: 8 x Nvidia ConnectX-6 VPI Adapter Card HDR IB and 200GbE Single-Port QSFP56 PCIe4.0 x16
Infiniband: 2 x Nvidia ConnectX-6 VPI Adapter Card HDR IB and 200GbE Dual-Port QSFP56 PCIe4.0 x16
2 x 3.84 TB | M.2
8 x 15.36 TB Samsung PM9A3 | U.2 | Gen4 | NVMe
8 x 15.36 TB Samsung MZQL215THBLA

This is the same system we use for our nightly benchmarking.

Spark benchmarks were conducted by Concurrency Labs. We selected Concurrency Labs because they specialize in cloud performance optimization. They opted to configure and optimize Spark on Amazon EMR. Amazon EMR has been finely tuned to maximize Spark's performance, leveraging Amazon's deep integration and optimization capabilities. Concurrency Labs ensured this was the best environment they could create for Spark execution in the cloud. For a detailed account of their methodology and specific decisions made, we refer readers to Concurrency Labs' own documentation and analysis, available through their blog.

The Spark system was run on AWS, using r5.8xlarge systems, with:

Compute: 32 cores per node
Memory: 256 GB per node

Data and queries derived from TPC-H

The benchmarking report is derived from the TPC-H and as such is not comparable to published TPC-H results, as the benchmarking results reported here do not comply with the TPC-H Specification.

TPC-H is a decision support benchmark developed by the Transaction Processing Performance Council (TPC), aimed at evaluating the performance of various database management systems (DBMS) in handling complex queries. TPC-H assesses the ability of a DBMS to process queries that are representative of real-world business applications. At a baseline, the benchmark measures the performance of a system through query response time, which we discuss as runtime. They cover a wide range of activities, including projecting, joining, sorting, and aggregating large volumes of data, reflecting real-world decision support systems and data warehouses.

TPC-H defines a set of 22 queries. These transactions simulate the activities found in complex application environments. Database sizes for the benchmark tests range from a few gigabytes to several terabytes of data, providing a way to measure and compare the performance of systems as they scale up.

Measurements

We report three main variables:

Runtime: The primary measure for the systems was runtime (in milliseconds). For Spark, we took the minimum runtimes across three execution runs for each query. For Theseus, we took an execution's runtime for each query.
Cluster cost per hour: We also used cluster cost per hour as a variable in our analyses and plots (see the technical appendix for costs).
Cost per query (CPQ): We calculated the cost per query by multiplying runtime in seconds and cluster cost per hour. For some plots, we summed these costs per query to generate cost per benchmark (across the set of 22 queries derived from the TPC-H).

Why base benchmarks on TPC-H?

Background: TPC-H is a business-focused benchmark provided by TPC

The Transaction Processing Performance Council (TPC) is a group dedicated to ensuring honesty, fairness, and quality in benchmarks. The TPC-H is only one of their suites, and emphasizes business use cases. In their words:

"The TPC-H is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions."

Standardized benchmarks increase legibility and trust

There is still debate surrounding standardized benchmarks. They are often criticized for being too synthetic and not generalizing well. However, while real-world use cases offer direct insights into system performance, they are inherently specialized. Outside the specific use case, this method complicates communication and imposes unnecessary burdens on both readers and writers. Standardized benchmarks like TPC-H streamline communication, providing a common language that facilitates understanding and comparison.

TPC-H's popularity and relevance enhance usability

TPC-H stands out among benchmarking suites for its broad recognition and business-centric approach. Its popularity promotes familiarity, making it easier for (report) readers to compare results. Additionally, its business focus aligns with the design of our Theseus engine as a product, which is optimized for petabyte-scale ETL, SQL, and dataframe workloads. By using TPC-H as the baseline for our derived benchmark, we capitalize on its familiarity and relevance, clearly demonstrating our systems' effectiveness and impact on real-world business issues.

Why did we use a third party for our Spark benchmarks?

To create useful system comparisons, we recognize that each system requires optimal configurations that often require deep technical expertise about that specific framework. Recognizing our expertise with Theseus and our limited familiarity with Spark, we contracted with Concurrency Labs, who have expertise optimizing Spark. This allowed us to feel confident that the Spark results were measured under the best possible conditions. This enabled us to confidently present findings that properly reflect each system's capabilities.

Are your benchmarks official?

The benchmarking report is derived from the TPC-H and as such is not comparable to published TPC-H results, as the benchmarking results reported here do not comply with the TPC-H Specification.

Technical appendix

This appendix contains detailed methods for how we collected our benchmarking data presented in this report.

What is included

We include the following details for benchmarking both Theseus and Apache Spark:

Hardware configurations
Cluster costs per hour
Engine parameters and version
Data, schema, and queries
Runtime commands (for Spark only)

Additionally, for Theseus, we detail the runtime conditions like:

Cold start: Benchmarks are presented with no warm-up runs. Execution begins with the first query in a set, and ends with query 22.
Data does not start in memory: All data is read from disk, in the open-source Parquet format.
Data is unsorted: No sorting is done before query execution.
No caching: No database-level caching methods are used in execution.

The benchmarking report is derived from the TPC-H which is not comparable to published TPC-H results, as the benchmarking results reported here do not comply with the TPC-H Specification.

What is excluded

We exclude the following details:

Source code and binaries for Theseus
EMR binaries for Spark

However, we include information about the data, compilation, systems, and runtime commands wherever possible.

Spark

All Spark benchmarks were executed by Concurrency Labs, on the managed AWS EMR service.

Hardware configuration

All systems used AWS EMR with r5.8xlarge, arranged in sets of 1 coordinator node with a variable amount of worker nodes. Concurrency Labs executed with 10, 20, 40, 60, 80, 100, and 200 worker nodes.

For the 10TB tests, the default EBS configuration was applied: 4 128 GB gp2 volumes per node.
For the 30TB tests, there was 1 EBS volume attached to each node, with the following settings: VolumeType=gp3, SizeInGB=1000, Iops=3000, Throughput=512 MiB/s

The Hive metadata store, required for Spark execution on EMR, used one m4.large EC2 instance and was backed by a db.t3.medium AWS RDS MySQL 5.7.38 DB instance.

Cluster costs per hour

Spark cluster costs per hour were calculated with the default EMR configuration of 512 GB of gp2 EBS storage per node, in the N. Virginia region.

Hardware	Nodes	Cost per hour (USD $)
r5.8xlarge	10	26.45
r5.8xlarge	20	50.02
r5.8xlarge	40	97.16
r5.8xlarge	60	144.31
r5.8xlarge	80	191.45
r5.8xlarge	100	238.59
r5.8xlarge	200	474.30

Software parameters and version

Spark was used with EMR version 6.11.0, Spark version 3.3.2, and Hive version 3.1.3. Spark configuration provided to EMR was as follows:

json

1[
2    {
3        "Classification": "spark-hive-site",
4        "Properties": {
5            "hive.metastore.uris": "thrift://<hive-metastore-url>:9083"
6        },
7        "Configurations": []
8    },
9    {
10        "Classification": "spark",
11        "Properties": {
12            "maximizeResourceAllocation": "true"
13        }
14    }
15]

The Hive metadata store configuration was as follows:

json

1{
2    "Classification": "hive-site",
3    "Properties": {
4        "javax.jdo.option.ConnectionURL": "jdbc:mysql://<rds-url>:3306/hive?createDatabaseIfNotExist=true",
5        "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
6        "javax.jdo.option.ConnectionUserName": "<username>",
7        "javax.jdo.option.ConnectionPassword": "<password>"
8    }
9}

Source code/Binary files

We cannot provide binaries as presented on EMR, because they are not provided by Amazon. However, for Spark 3.3.2 binaries and source code, please refer to their archive: https://archive.apache.org/dist/spark/spark-3.3.2/

Query execution and measurement

Queries are executed using a Spark CLI command per each query, such as the following:

spark-sql -S -f <location-of-query-file> --name <query-name>

Results are gathered using the YARN Timeline Server, subtracting the timestamps recorded per each query under columns StartTime and FinishTime. These values match the execution timestamps recorded in the Spark History Server as well.

Optimization/System parameters

Excluded here – prepackaged in EMR. You can find information on Amazon's configurations here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html

Theseus

Runtime conditions

We would like to highlight the following conditions at runtime:

Cold start: Benchmarks are presented with no warm-up runs. Execution begins with the first query in a set, and ends with query 22.
Data does not start in memory: All data is read from disk, in the open-source Parquet format.
Data is unsorted: No sorting is done before query execution.
No caching: No database-level caching methods are used in execution.

Hardware configuration

Theseus benchmarks were conducted using a 10 node Lambda Labs GPU machine, with the following configuration per node:

Motherboard: SuperMicro X12DGO-6
Compute: 8 x Nvidia SXM4 A100 GPUs 80GB VRAM
Memory: 4 TB (32x Samsung M393AAG40M32-CAE 128 GB 3200 MHz ECC RAM)

The distributed system at large has the following underlying hardware for networking and storage:

Infiniband: 8 x Nvidia ConnectX-6 VPI Adapter Card HDR IB and 200GbE Single-Port QSFP56 PCIe4.0 x16
Infiniband: 2 x ConnectX-6 VPI Adapter Card HDR IB and 200GbE Dual-Port QSFP56 PCIe4.0 x16
2 x 3.84 TB | M.2
8 x 15.36 TB Samsung PM9A3 | U.2 | Gen4 | NVMe
8 x 15.36 TB Samsung MZQL215THBLA

Cluster costs per hour

Theseus cluster costs per hour were estimated using cloud pricing from Lambda Labs:

Hardware	Nodes	Cost per hour (USD $)
A100	2	24
A100	4	48
A100	6	72
A100	8	96
A100	10	120

Software parameters and version

Theseus is under active development and is versioned daily based on the last commit. This report was compiled with data collected on September 2, 2024.

Source code/Binary files

Theseus is a proprietary data processing engine developed by Voltron Data, and the software for Theseus is closed source. We cannot provide source code and binaries.

Query execution and measurement

Queries are executed using an internal interface.

Runtime is gathered via use of time.monotonic() before and after each query execution.

Optimization/System parameters

In the system, we use UCX for networking, along with GDS for file I/O. We use CUDA version 12.3 at the time of publication.

Data and schema

Data used in this set of tests corresponds to the 10TB, 30TB, and 100TB datasets derived from the TPC-H benchmark, in Parquet format. All tests for Theseus and Spark used the same data sets. For the derived TPC-H, data corresponds to the 8 tables that are part of the derived TPC-H benchmark.

Data was originally created and converted to Parquet format by Voltron Data's engineering team, with zstd compression. Note that due to storing the files in Parquet, compressed with zstd, the file sizes are smaller than the scale factor suggests due to the compression in the Parquet format. However, they were generated with scale factors 10000 (10TB) and 30000 (30TB), and reach these sizes in-memory. Our Theseus tests further include a 100TB dataset, generated with scale factor 100000.

10TB Data

Table	Records	Approximate Table Size
customer	1,500,000,000	260GB
lineitem	59,999,994,267	6.41TB
nation	25	3.1KB
orders	15,000,000,000	1.49TB
part	2,000,000,000	300GB
partsupp	8,000,000,000	1.1TB
region	5	1.9KB
supplier	100,000,000	20GB
Total	86,599,994,297 rows	~10TB

30TB Data

Table	Records	Approximate Table Size
customer	4,500,000,000	780GB
lineitem	179,999,978,268	19.23TB
nation	25	3.1KB
orders	45,000,000,000	4.47TB
part	6,000,000,000	900GB
partsupp	24,000,000,000	3.3TB
region	5	1.9KB
supplier	300,000,000	60GB
Total	259,799,978,298 rows	~30TB

100TB Data

Table	Records	Approximate Table Size
customer	15,000,000,000	2.6TB
lineitem	~600,000,000,000	64.1TB
nation	25	3.1KB
orders	150,000,000,000	14.9TB
part	20,000,000,000	3TB
partsupp	80,000,000,000	11TB
region	5	1.9KB
supplier	1,000,000,000	200GB
Total	~866,000,000,030 rows	~100TB

Queries derived from TPC-H

Queries were derived from the TPC-H v3.0.0 specification. Original query text made available by TPC had to be slightly modified in order to specify the schema for each table (for example FROM <schema>.lineitem instead of FROM lineitem).

A LIMIT statement, as per TPC-H specification, was added to the following derived queries: q02, q03, q10, q18 and q21. A SQL condition in q11 was updated based on the Scale Factor of the relevant dataset.

To learn more about our report, check out the data analysis or read about our methodology.

If you are feeling data processing pain on a petabyte-scale, we are here to help.

Explore Theseus Request a Demo