Spotlight

Accelerated Computing is the Next Big Bet

First mover advantage will go to those who synchronize open standards with accelerated hardware.

Our team has extensive hardware knowledge to help businesses maximize the use of CPUs, GPUs, FPGAs, networking, and storage solutions with projects like RAPIDS, Velox, UCX, and more.

Compare data engines running 17GB of Parquet data on GPUs and CPUs

Hardware logo

Blog | Single GPU Outperforms 88 CPUs by 2.5X

url icon

GPUs are essential for machine learning and AI at enterprise scale – and are becoming increasingly  critical for  data preprocessing workloads. Recent benchmarking showed how a single NVIDIA V100 GPU outperforms a total of 88 x86 CPU cores by ~2.5X. This also showed a single GPU running on RAPIDS outperforms Spark on an 88 core cluster by 20X. Get the full results in this featured resource.

Collab Core Repo link icon
Benchmarks
Data Engines
GPU
Benchmarking Data Engines

↳ Benchmarking Data Engines

Swap Java-based engines with Velox for a 3X perf improvement

Velox Logo

Report | Velox: 2023  Project to Watch

url icon

Velox is an embeddable columnar database engine designed for Arrow-based systems. This modular, unification standard benefits industries using and developing data management systems. Learn how Velox breaks down data silos and accelerates data processing.

Velox Data Infrastructure

↳ Velox Data Infrastructure


“Data management systems like Presto and Spark typically have their own execution engines and other components. Velox can function as a common execution engine across different data management systems.”

Source: Facebook Engineering Blog. Diagram by Phillip Bell.

Use what you have locally, spin up cloud test environments, and deploy to production seamlessly

Ibis Logo

Blog | Scale From Local to Distributed

url icon

Ibis is a Python analytics library that integrates with over 18 data execution backends, enabling developers to write, execute, and test data workflows locally - and then seamlessly scale up to production workloads with distributed computing back-ends. In this featured resource, see how Ibis makes this possible using DuckDB, Spark, and Amazon EMR.

Code Repo link icon
Cloud
Data Engines
Code

  import ibis   
  from ibis import _   
  import pandas as pd   
  from pyspark.sql import SparkSession   
     
  # Setup pandas   
  pd.set_option("display.width", 0)   
  pd.set_option("display.max_columns", 99)   
  pd.set_option("display.max_colwidth", None)   
  pd.set_option("display.float_format", '{:,.2f}'.format)   
     
  # A. Get a Spark Session   
  spark = SparkSession \   
      .builder \   
      .appName(name="Ibis-Rocks!") \   
      .getOrCreate()   
     
  # B. Connect the Ibis PySpark back-end to the Spark Session   
  con = ibis.pyspark.connect(spark)   
     
  # C. Read the parquet data into an ibis table   
  trip_data = con.read_parquet("s3://nyc-tlc/trip data/fhvhv_tripdata_*.parquet")   
     
  trip_data = trip_data.mutate(total_amount=_.base_passenger_fare + _.tolls + _.sales_tax + _.congestion_surcharge + _.tips)   
     
  trip_summary = (trip_data.group_by([_.hvfhs_license_num])   
  .aggregate(   
      trip_count=_.count(),   
      trip_miles_total=_.trip_miles.sum(),   
      trip_miles_avg=_.trip_miles.mean(),   
      cost_total=_.total_amount.sum(),   
      cost_avg=_.total_amount.mean()   
  )   
  )   
     
  print(trip_summary.execute())

↳ Running Ibis on PySpark on EMR

Dig into additional technical resources discussing how to redefine flexibility and maximize productivity for data system design. 

Discover more on our blog and follow us on social.

Resources