When people think about query engines and databases they typically expect there to be some support for concurrent queries as large pipelines and multiple requests hit the engine simultaneously.

In the past we adopted a policy of one session to one cluster (group of GPU workers) per query since the GPU workers were so fast, but spinning up new GPU clusters has setup time costs, and sometimes customers want to submit large batches of queries to the engine runtime all at once (either for one large workload or because of multiple users). This coupled alongside our improved performance at smaller scale means concurrency has become increasingly consequential for our users.

To solve this, we created an abstraction inside Theseus that allows us to create multiple policies for concurrency, the first three of which are:

FIFO - the tasks of a particular query are prioritized by when said query came in.
Round Robin - the tasks themselves are prioritized based on a round robin mechanism.
Small Queries First - the tasks of smaller queries are prioritized to get processed and out of the way.

Concurrency testing and benchmarking are typically presented in a results report ( e.g. page 3 of HPE SF10K TPC-H results report). Therefore, we adopted a similar report structure for our concurrency results.

Concurrency Results

Specs

Compute - 37 x AWS g6.4xlarge (L4 GPUs) instances
Data/Storage - AWS S3 using unsorted Parquet files with zstd compression

TPC-H SF10,000 (10TB) Experiment

While not limited to AWS specifically, this experiment was done on AWS. When we did a power run of TPC-H SF10K we were able to run in 403s, and if we ran that 9 times sequentially the best time we would get is 3,627s.

Power Run	Runtime (s)	Runtime x 9 (s)
1	403	3,627

FIFO Concurrency

As you can see running 9 streams simultaneously provide a 10% improvement on average across the board, and some streams actually run almost 30% faster.

	Runtime (s)	% Better than Power Run x 9
Average Concurrent Run	3229.64	10.96%

What this means is there is left over GPU, networking, and IO opportunity in Theseus and AWS and we can process large workloads with hundreds or thousands of queries even better.

Concurrency Stream	Runtime (s)	% Better than Power Run x 9
1	3,379.99	6.81%
2	3,077.37	15.15%
3	3,608.19	0.52%
4	3,365.73	7.20%
5	3,598.78	0.78%
6	3,370.95	7.06%
7	2,757.30	23.98%
8	2,572.95	29.06%
9	3,335.45	8.04%

Conclusion

This demonstrates that concurrency is not only supported in Theseus, but can dramatically improve performance. On average, Theseus provides a 10% improvement across all steams, with a peak performance benefit of 30% for top performing streams. Support for concurrent users and jobs removes idle GPU cycles, improves efficiency and reduces costs. We’ll continue to develop new optimizations, run more experiments, and add concurrency tests to our nightly benchmarking.