Spotlight

Don’t Rip and Replace Your Platform. Augment with Standards.

Let's evaluate what you have, determine your goals, and get to work. We design and build composable, modular data systems using open standards to maximize flexibility. Get to know the technologies we use and how they shape the future of data systems.

Avoid data serialization and deserialization penalties

Arrow Flight Logo

Blog | Arrow Flight: A Primer

url icon

Serializing and deserializing data can be responsible for as much as 90% of the total time that it takes to move data, slowing things down and driving up costs. Arrow Flight combines gRPC, Protocol Buffers, and the Arrow libraries to provide a performant, easy-to-use RPC framework specialized for transferring Arrow data. Read the Flight primer for more details.

Arrow Flight Software Stack

↳ Arrow Flight Software Stack



Under the Hood: How Flight Sends Arrow Data

↳ Under the Hood: How Flight Sends Arrow Data

Get the flexibility of pandas, scale and performance of Spark

Ibis Logo

Tutorial | Scale Out to Spark with Ibis

url icon

Rewriting code to scale from local to distributed adds a non-trivial amount of complexity and risk. With Ibis, users can define data operations in Python, then execute on one of 18+ engines, such as BigQuery, Snowflake, or in this case Spark, without changing any code. Learn how Ibis provides the portability of Python analytics with the scale and performance of modern SQL.


  import ibis
  from ibis import _
  from pyspark.sql import SparkSession

  def build_features(table):
    user_logs = table.mutate(log_date =
  _.date.cast("string").to_timestamp("%Y%m%d").date()).drop("date")
    user_logs_agg = user_logs.aggregate(
      by=["msno"],
      sum_total_secs = _.total_secs.sum(),
      avg_total_secs = _.total_secs.mean(),
      max_total_secs = _.total_secs.max(),
      min_total_secs = _.total_secs.min(),
      total_days_active = _.count(),
      first_session = _.log_date.max(),
      most_recent_session = _.log_date.min(),
      total_songs_listened = (_.num_25 + _.num_50 + _.num_75 + _.num_985 + _.num_100).sum(),
      avg_num_songs_listened = (_.num_25 + _.num_50 + _.num_75 + _.num_985 + _.num_100).mean(),
      percent_unique = (_.num_unq/(_.num_25 + _.num_50 + _.num_75 + _.num_985 + _.num_100)).mean(),)

  return user_logs_agg

  session =
  SparkSession.builder.appName("kkbox-customer-churn").getOrCreate
  #if you don’t use legacy mode here, you’ll need to change the timestamp formatting code
  session.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
  df =
  session.read.parquet("gs://voltrondata-demo-data/kkbox-churn/user_logs/*.parquet")
  df.createOrReplaceTempView("user_logs")

  ()ibis_con = ibis.pyspark.connect(session)
  table = ibis_con.table("user_logs")

  user_logs_agg = build_features(table)
  ibis_con.create_table("user_logs_agg", user_logs_agg)

↳ Connecting Ibis to distributed systems

Standardize data movement for flexible systems

Arrow Flight Logo

Tutorial | Scale Out to Spark with Ibis

url icon

Data communication standards like Arrow are the backbone of a modular, open stack. The SQLAlchemy driver leverages the power of Arrow via FlightSQL and deploys Superset (still in POC) as the interface. Experience the flexibility of Arrow FlightSQL and ADBC for scale out workflows.

A local, dockerized Superset client connected to a Flight SQL-based DuckDB database server

↳ A local, dockerized Superset client connected to a Flight SQL-based DuckDB database server



Engineering
Open Source Standards
Technical Level 3/5

Dig into additional technical resources discussing how to optimize costs and maximize productivity for data system design.

Discover more on our blog and follow us on social.

Resources