Nov 07, 2023
Google Chooses Ibis to Enable Dataframes in BigQuery
Kae Suarez
When it comes to storing and accessing large datasets from anywhere, Google BigQuery is a household name, even supporting public use through the BigQuery Public Datasets Platform. Enabling high-speed access to large volumes of data, BigQuery cannot be ignored. However, in the past, SQL was the main course of entry — for many, this was adequate, but for those accustomed to Python-native workflows, it was a hurdle to being able to work with such a beneficial platform.
Recently, Google has remedied this with the release of BigQuery Dataframes — or bigframes
, as the package is called. This open-source solution enables pandas and scikit-learn style APIs for BigQuery. If this addition to the open-source ecosystem isn’t exciting enough, we are happy to note that it uses the Python interface for data we love to develop and talk about: Ibis.
In this post, we will dive into what makes BigQuery Dataframes so exciting for integrating a powerful data source into any of your workflows, and how Ibis enabled this new tool.
BigQuery Dataframes
As noted before, BigQuery’s main interface is SQL. SQL can be useful for accessing and analyzing data, but some tools are not available in pure SQL — for example, machine learning (ML) workflows, which force users to use tools like PyTorch and XGBoost. As a result, one could easily have a situation where they download data from BigQuery, then analyze it in their tool of choice. However, this conversion adds processing overhead to a platform intended for speed, and that overhead compounds as the dataset grows larger.
BigQuery Dataframes addressed this problem by simply adding a dataframe API, which is used in Python. By making all calls through this API, data can be queried efficiently for the use case and will enable more ML workflows to integrate BigQuery in the future, without any overhead.
This solution is best for users, but how was it done? How did Google implement a dataframe API on top of SQL?
Fortunately, the open-source ecosystem has Ibis. In case you are unfamiliar, Ibis is a Python interface for data analytics, and you can learn more about it here:
- Breaking Down the First Principles of Ibis
- 383 Ibis Expressions and the Only Language You Need is One
- Getting Started with Ibis
By leveraging Ibis’s BigQuery backend (one of the 18+ backends that Ibis supports), Google was able to leave the work of querying and wrangling data into Python to Ibis and focus on building the bigframe
API itself. This is an exciting new use case for Ibis, as a glue component that cuts down the time needed to produce versatile workflows that integrate data sources that are traditionally SQL-exclusive. For a look at how they did it, check out the BigQuery Dataframe repository on GitHub — because the entire product is open source, just like Ibis.
Innovating in the Open Source Ecosystem
Ibis is being adopted by leading technology organizations like Microsoft, Heavy.AI, DuckDB, b.telligent, and now Google Cloud. Open source is at its best when the ecosystem can integrate to enable modular and composable stacks that can suit any use case.
Voltron Data actively maintains and contributes to the Ibis project and we encourage developers to get involved in building this project. If you’re interested in bringing composability to your data system, check out our Product page or see how open-source software can augment your system to increase flexibility and reduce your total cost of ownership.