Alleviating S3 Bottlenecks and Boosting I/O Performance

Like many enterprises, Voltron Data customers rely on AWS S3 for data storage. Accessing S3 efficiently and cost-effectively is challenging as data pipelines grow in size and complexity. In this blog post, we discuss our recent work and results in alleviating S3 bottlenecks to boost I/O performance.

The Numbers

Before the cloud performance work, we ran a Theseus baseline TPC-H benchmarks on g5.4xlarge AWS instances, which resulted in a total runtime of 4518 seconds for all 22 queries. Over two months, adding metadata caching, pre-fetching, and the new RESTful datasource, improved our TPC-H performance on the same infrastructure from 4518 seconds to 2249 seconds. That’s a 2X improvement on the same cloud environment. More importantly, as we become more familiar with cloud architectures, quick wins are becoming more abundant.

Theseus TPC-H SF10K benchmarks comparing before and after cloud performance improvements

Continue reading to learn more about how we achieved these results.

Metadata Filtering and Caching

When running queries, Theseus leverages file metadata by pushing down predicates to the metadata layer, which enables it to only read the files it needs. Unfortunately, metadata collection in S3 is expensive. So now, instead of naively listing or reading full file metadata at runtime, we capture metadata snapshots and cache them during table creation. This approach is particularly beneficial in production workloads with frequent queries over large partitions or evolving datasets. The cached snapshots reside in a lightweight layer per table, accelerating file discovery and pruning without impacting performance.

To avoid repeated downloads across queries, we’re designing a disk-backed cache for prefetched bytes. When a query touches the same table or file segments, Theseus can read from the local cache rather than S3. This caching layer mirrors patterns in other high-performance systems but is tailored to our byte-range prefetcher and RESTful data source (described below). By invalidating entries only when underlying data changes, we ensure correctness and reduce I/O costs.

New RESTful S3 Data Source

S3 read performance often depends on network latency and client S3 read performance often depends on network latency and client implementation. To address this, we developed a proprietary “RESTful S3” data source using Boost libraries to communicate directly via REST APIs and thereby achieving higher throughput. Although initial authentication fixes to RAPIDS cuDF (e.g., a patch to KvikIO) produced modest gains, the underlying Curl implementation remained underperformant. The RESTful S3 source delivers significantly faster data transfer into host or device memory. Future iterations will extend authentication compatibility and add optimizations for GCP and Azure object stores.

Prefetching Framework

Synchronous data downloads during scans can slow processing. To overlap I/O and compute, we built a flexible prefetching framework that launches asynchronous tasks to fetch data ahead of processing, allowing the GPU to be fully utilized and unblocked by synchronous I/O. Decompression and decoding occur on the GPU, while bytes are prefetched and cached in host memory before they’re needed. Leveraging new Arrow Parquet reader APIs, we added a byte-range prefetcher. It determines which byte ranges of a Parquet file are necessary for upcoming scans. It downloads them in advance, placing them in pinned memory buffers, so that they are ready to be loaded efficiently onto the GPU. Even without the RESTful data source, this reduces wall-clock time by overlapping network fetches and CPU/GPU parsing. Combined with a faster data source, byte-range prefetching ensures the engine rarely stalls on I/O.

Wrap Up

By combining metadata filtering, a high-performance RESTful data source, and an extensible prefetching framework, Theseus continues to advance cloud-based analytics performance, resulting in a 2X performance improvement.

These enhancements empower data engineers to run GPU-accelerated queries at scale with minimal I/O bottlenecks across any major cloud provider. We look forward to sharing detailed benchmarks as more PRs are pushed and features land.

Acknowledgements

We would like to acknowledge and thank the engineering contributions from: Amin Aramoon, Supun Kamburugamuve, Joost Hoozemans, Pradeep Garigipati, Matt Topol, andAhmet Uyar.