Methodology
A Better Way to Benchmark
What is benchmarking?
Benchmarking data systems is part art, part science. The science lies in the actual metrics we measure. These metrics are straightforward, standard, and generally unsurprising. The art is…everything else.
Science = Metrics | Art = Everything else |
---|---|
|
|
Unfortunately, all of this usually leads to a plot like this:
FIGURE 01: Adapted from Fair Benchmarking Considered Difficult
If you have been burned by this kind of benchmark in the past, you are not alone. Benchmarking bake-offs like this do not help people think about the right things when it comes to evaluating data system performance. This makes benchmarking feel less like an art and a science, and more like a sport. But it does not have to be this way.
Architecture matters: Benchmarking apples and oranges
The biggest problem we see with benchmarks is that they often pit fundamentally different systems against each other, like comparing apples to oranges.
Benchmarking single node systems is straightforward: you can install two engines, quickly compare runtimes, and eyeball the winner in the shortest column contest. But data size will quickly push you past what a single node can handle, towards systems that are distributed across multiple nodes, accelerated by GPUs, or both.
Once you move onto accelerated systems that rely on heterogeneous hardware for performance gains, comparisons get messy. Different hardware executes at different speeds for different workloads, and all at different costs. When comparing “shared nothing systems” that run on different hardware, no direct comparisons are fair. GPU-based data processing systems will never share the traits we are used to full-CPU systems sharing.
Yet, there is an art to benchmarks that can bridge across different hardware architectures. We are not saying we solved all the problems with benchmarking. But we have put a lot of work into finding a better way.
ROI matters: All benchmarks are wrong but some are useful
People (who are not vendors) need benchmarks to estimate the return on investment (ROI) for big data system decisions.
To get a handle on a data system’s ROI, customers need to evaluate efficiency, or how well a data system can make use of resources already in place. Efficiency has two primary pieces:
- Hardware efficiency - typically quantified as total cost of ownership (TCO).
- People efficiency - difficult to quantify with a single number, but can be estimated with proxy measures like productivity, time to insights, time spent waiting, etc.
The path to optimizing for efficiency is unique for any business and difficult for any vendor to pin down. Vendors do have one key that affects both: system performance. Without good performance, existing resources – both tech and people – are not used to their full potential. A performant system foundation is critical for efficiency and ultimately ROI. In contrast to the subjective measurement of ROI and people efficiency, performance is straightforward to measure.
Context matters: Performance is more than a number
Given how crucial performance is for efficiency, it should be treated as an important feature in its own right to evaluate any data system. But for performance numbers to be useful, you need to know how performance was measured. Here are some other facts you will need to set the context for understanding performance:
-
Hardware: At a minimum, this is a list of compute hardware in a given node, memory size and type, and disk size and type. In a distributed system, the amount of nodes and the type of network switch (e.g., Ethernet, Infiniband, Slingshot) are also important. Model numbers can help, but are often not shared.
-
System configurations: Every hardware setup has configurations that impact performance. For accelerator-native systems, the most impactful ones include the use of GPUDirectStorage for file I/O to GPU, and advanced transport protocols like UCX over the network switches.
-
Benchmark configuration:
- Task: In a database benchmark, like TPC-H, TPC-DS, or YCSB, the task is a set of queries set to some requirements or theme. For example, TPC-H focuses on business queries for decision support.
- Data: For a given database benchmark, the size of data is a scalable value. For example, in the TPC-H benchmarks, this value is called scale factor. Some examples of popular points for evaluation include 1 TB, 10 TB, 30TB, and 100 TB.
-
Measurements: Once a benchmark is being run, it yields a measure of the time required to complete each individual task, which can be summed to get a total runtime. Some suites will include further metrics by default, such as TPC-H with their TPC-H Composite Query-per-Hour Performance Metric. Further specializations can include finding the measures of center for the runtimes, finding the average time of all tasks in the suite, or using secondary values like cost per hour to create derivative metrics.
Making really useful benchmarks
Performance is relatively easy to measure, but that is just the science part. The harder part is getting the art right: the challenge of framing performance data within its appropriate context for accurate comparisons and conclusions. If a classic two bar “perf plot” bake-off is all you get, it will be hard to make good design decisions about your data system.
To actually be useful, vendors (like us) need to do more work behind the scenes to figure out:
- What other systems can we validly measure?
- How can we make useful comparisons between systems?
- How can we give readers enough context to make better decisions based on the benchmarks?
We have selected three metrics that we find most useful to measure and compare:
- Runtime
- Cost per hour: Paying for your infrastructure
- Scale up costs: Paying for your infrastructure to cover your runtime
Runtime:
We have all seen benchmarking plots like this:
FIGURE 02: Two-bar charts convey incredible results.
This type of plot is really easy to game, and we are going to show you how. We actually collected a lot more runtime data than we are showing in the column plot above:
- We collected 3 runs per query for Spark, summed the query time across all 22 queries derived from the TPC-H query set, then took the minimum across the runs and all the queries to come up with this single number.
- Next, we cherry-picked the number of nodes for each system to maximize the difference between system runtimes.
In fact, the only time we actually have the same number of nodes across both systems is for a single node count: 10 nodes. But all nodes are not created equal. One GPU node is definitely not comparable to one CPU node, neither in performance nor cost. With “shared nothing data systems,” comparing variables like number of nodes does not make any sense.
Cost per hour: Paying for your infrastructure
Everyone cares about cost. Most care more about cost than number of nodes, because different nodes can incur wildly different cost consequences. Comparing costs lets you cut across these kinds of architectural details, because all dollars are created equal.
FIGURE 3: More bars for more data.
This is more meaningful than the first plot. But, for big data system design decisions, this plot only gets you so far: it looks like you can “just” pay more for speedups by scaling up with either system (hopping column groups from left to right), but this is not the full story. Looking ahead, there are two possibilities for scaling up:
- Pay for more nodes per hour, and because of the total runtime gains, the total cost is lower.
- Pay for more nodes per hour, but the system simply cannot deliver runtime gains that make the extra investment worth it.
This plot will not help you either way. What you need is to plot runtime against costs, so you can see how investing in scaling up impacts your bottom line.
Scale up costs: Paying for your infrastructure to cover your runtime
We need to plot total runtime versus a normalized cost per hour for a particular cluster configuration. We call this the SPACE (scale performance and cost efficiency) chart.
FIGURE 4: The SPACE chart.
To make this chart, we plot cluster cost per hour (in $USD) on the horizontal axis, and total runtime across all queries in our test along the vertical axis (the 22 queries derived from TPC-H). We organize and label our data points by node count. Here is how to read this plot:
- Left is cheaper
- To the left and top means cheaper but slower (high runtime)
- To the left and bottom means cheaper and faster (short runtime)
- Right is pricier
- To the right and top means more expensive and slower (high runtime)
- To the right and bottom means more expensive but faster (short runtime)
This type of plot normalizes a key variable that cannot be cleanly compared across architectures: node count. With this plot, we can see two new patterns:
- Cost wall: If cost savings were your focus, you can clearly see that while you can always pay for more scale in either system, there is a law of diminishing returns. This is most salient for Spark, where you can see runaway costs as the top curve stretches out over $450 per hour for the maximum node count (200). At 10TB, Theseus performance even at the minimum GPU node count of 2 eclipses the best runtime we can squeeze out of Spark if we go all in on cost.
- Technical wall: Even with infinite money, this plot highlights that scale cannot solve the problem with Spark. Spark 200 nodes clearly outperforms 100 nodes for Spark in terms of raw runtime, but performance hits a technical wall. CPU performance is capped. No amount of money can jump over this wall. A GPU-based system like Theseus starting with just 2 nodes already delivers runtime gains that are simply not possible with Spark for any price.
This is how benchmarks can still be useful, even when comparing apples to oranges.
Using our own useful benchmarks
Over the years, while building our data processing engine, Theseus - we found this way of calculating and representing benchmarks useful for answering our toughest questions. How would we know if we were actually making a useful engine? How would we know if we are actually improving this engine? How would we be able to track if our valuable engineering resources were going to actually make a difference to future customers and our end users?
We built a continuous benchmarking infrastructure that includes:
- Nightly benchmarking: Our “nightlies” are a report that runs every night and is published internally company-wide. Nightlies reflect configurations that benchmark Theseus overnight across a range of workload sizes and types.
- Continuous benchmarking: This is at the commit level, including automated tracking of improvements and regressions.
- Telemetry reporting: We continuously collect telemetry data to address and understand regressions at the query level.
- Feature benchmarking: For developers to test and evaluate possible new features, we have a system to easily configure experimental benchmarks.
Our benchmarking system has evolved alongside our product. What we started with looks very different than it does today. As our product has gotten better, we have also come up with better ideas for presenting and explaining our performance improvements. The metrics and visualizations we have refined, like our SPACE chart, have strengthened our confidence that we are continually improving Theseus. We also used our internal benchmarking methods to build this report.
To learn more about our report, check out the data analysis or read about our methodology.