Avoid data serialization and deserialization penalties
Blog | Arrow Flight: A Primer
Serializing and deserializing data can be responsible for as much as 90% of the total time that it takes to move data, slowing things down and driving up costs. Arrow Flight combines gRPC, Protocol Buffers, and the Arrow libraries to provide a performant, easy-to-use RPC framework specialized for transferring Arrow data. Read the Flight primer for more details.
Arrow Flight RPC Documentation
Arrow Cookbooks
↳ Arrow Flight Software Stack
↳ Under the Hood: How Flight Sends Arrow Data
Get the flexibility of pandas, scale and performance of Spark
Tutorial | Scale Out to Spark with Ibis
Rewriting code to scale from local to distributed adds a non-trivial amount of complexity and risk. With Ibis, users can define data operations in Python, then execute on one of 18+ engines, such as BigQuery, Snowflake, or in this case Spark, without changing any code. Learn how Ibis provides the portability of Python analytics with the scale and performance of modern SQL.
Ibis Project Page
Ibis Getting Started Guide
Ibis Example Notebooks
import ibis
from ibis import _
from pyspark.sql import SparkSession
def build_features(table):
user_logs = table.mutate(log_date =
_.date.cast("string").to_timestamp("%Y%m%d").date()).drop("date")
user_logs_agg = user_logs.aggregate(
by=["msno"],
sum_total_secs = _.total_secs.sum(),
avg_total_secs = _.total_secs.mean(),
max_total_secs = _.total_secs.max(),
min_total_secs = _.total_secs.min(),
total_days_active = _.count(),
first_session = _.log_date.max(),
most_recent_session = _.log_date.min(),
total_songs_listened = (_.num_25 + _.num_50 + _.num_75 + _.num_985 + _.num_100).sum(),
avg_num_songs_listened = (_.num_25 + _.num_50 + _.num_75 + _.num_985 + _.num_100).mean(),
percent_unique = (_.num_unq/(_.num_25 + _.num_50 + _.num_75 + _.num_985 + _.num_100)).mean(),)
return user_logs_agg
session =
SparkSession.builder.appName("kkbox-customer-churn").getOrCreate
#if you don’t use legacy mode here, you’ll need to change the timestamp formatting code
session.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
df =
session.read.parquet("gs://voltrondata-demo-data/kkbox-churn/user_logs/*.parquet")
df.createOrReplaceTempView("user_logs")
()ibis_con = ibis.pyspark.connect(session)
table = ibis_con.table("user_logs")
user_logs_agg = build_features(table)
ibis_con.create_table("user_logs_agg", user_logs_agg)
↳ Connecting Ibis to distributed systems
Standardize data movement for flexible systems
Tutorial | Scale Out to Spark with Ibis
Data communication standards like Arrow are the backbone of a modular, open stack. The SQLAlchemy driver leverages the power of Arrow via FlightSQL and deploys Superset (still in POC) as the interface. Experience the flexibility of Arrow FlightSQL and ADBC for scale out workflows.
↳ A local, dockerized Superset client connected to a Flight SQL-based DuckDB database server
Dig into additional technical resources discussing how to optimize costs and maximize productivity for data system design.
Discover more on our blog and follow us on social.
Resources