Compare data engines running 17GB of Parquet data on GPUs and CPUs
Blog | Single GPU Outperforms 88 CPUs by 2.5X
GPUs are essential for machine learning and AI at enterprise scale – and are becoming increasingly critical for data preprocessing workloads. Recent benchmarking showed how a single NVIDIA V100 GPU outperforms a total of 88 x86 CPU cores by ~2.5X. This also showed a single GPU running on RAPIDS outperforms Spark on an 88 core cluster by 20X. Get the full results in this featured resource.
↳ Benchmarking Data Engines
Swap Java-based engines with Velox for a 3X perf improvement
Report | Velox: 2023 Project to Watch
Velox is an embeddable columnar database engine designed for Arrow-based systems. This modular, unification standard benefits industries using and developing data management systems. Learn how Velox breaks down data silos and accelerates data processing.
Velox Data Thread Talk
Velox Blog
Velox Research Paper from VLDB
↳ Velox Data Infrastructure
“Data management systems like Presto and Spark typically have their own execution engines and other components. Velox can function as a common execution engine across different data management systems.”
Source: Facebook Engineering Blog. Diagram by Phillip Bell.Use what you have locally, spin up cloud test environments, and deploy to production seamlessly
Blog | Scale From Local to Distributed
Ibis is a Python analytics library that integrates with over 18 data execution backends, enabling developers to write, execute, and test data workflows locally - and then seamlessly scale up to production workloads with distributed computing back-ends. In this featured resource, see how Ibis makes this possible using DuckDB, Spark, and Amazon EMR.
import ibis
from ibis import _
import pandas as pd
from pyspark.sql import SparkSession
# Setup pandas
pd.set_option("display.width", 0)
pd.set_option("display.max_columns", 99)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.float_format", '{:,.2f}'.format)
# A. Get a Spark Session
spark = SparkSession \
.builder \
.appName(name="Ibis-Rocks!") \
.getOrCreate()
# B. Connect the Ibis PySpark back-end to the Spark Session
con = ibis.pyspark.connect(spark)
# C. Read the parquet data into an ibis table
trip_data = con.read_parquet("s3://nyc-tlc/trip data/fhvhv_tripdata_*.parquet")
trip_data = trip_data.mutate(total_amount=_.base_passenger_fare + _.tolls + _.sales_tax + _.congestion_surcharge + _.tips)
trip_summary = (trip_data.group_by([_.hvfhs_license_num])
.aggregate(
trip_count=_.count(),
trip_miles_total=_.trip_miles.sum(),
trip_miles_avg=_.trip_miles.mean(),
cost_total=_.total_amount.sum(),
cost_avg=_.total_amount.mean()
)
)
print(trip_summary.execute())
↳ Running Ibis on PySpark on EMR
Dig into additional technical resources discussing how to redefine flexibility and maximize productivity for data system design.
Discover more on our blog and follow us on social.
Resources