Aug 24, 2023
The Standard Dataframe Language for Data Analysis and Data Engineering
Marlene Mhangami and Fernanda Foertter
A Growing Landscape of Tools
As the field of data engineering and data analysis continues to evolve, so does the range of tools available for these tasks. A few years ago, pandas was the go-to framework for all things data in the Python ecosystem. Though it’s maintained its popularity over time, it’s become clear that pandas isn’t always the best tool for the job. Numerous alternatives have been created to provide more optimal solutions for specific tasks. For example, workflows associated with data engineering and data analysis are not identical. They require different approaches and sometimes completely different tools altogether.
Data analysis focuses on extracting insights, generating charts, and understanding data, often in a notebook environment. For smaller datasets, using pandas with visualization packages like Matplotlib or Plotly works well. For datasets larger than memory, using a library like Dask provides a more performant alternative and, paired with its data visualization tool, Datashader, you can plot millions or billions of data points in a single image. On the other hand, as datasets have grown larger, even Dask is comparatively slower to newer tools that handle big data better.
Additionally, for tasks associated with data engineering, an emphasis is placed on writing production code that runs regularly, focusing on speed, reliability, and maintaining data pipelines. SQL-based systems tend to be a good choice for such workflows and libraries like Polars, DuckDB, Snowflake, and BigQuery provide excellent options for working with data in production and also offer great tools for fast analysis on big datasets as well.
The Future: A More Modular Approach
There is a growing consensus that the future of data is modular. In a talk last year, Marc Garcia, a core maintainer for pandas, shared that instead of relying on a single tool like pandas to solve all data analysis problems, the focus is shifting towards integrating different backends, and data representations to suit specific use cases. This approach has been advocated for by many others and allows for greater flexibility, better customization, and the ability to choose the right tool for each aspect of a data pipeline.
Composable data systems are more efficient and it’s important to understand how to augment systems with standards. Working with Tabular data in PySpark is very different from doing the same in Data Fusion, Druid, or MySQL. As a user recently pointed out in this thread on social media, “It is time for a standard dataframe ‘language’.” We need a dataframe language that not only provides a standard interface for the various tools available but also one that makes the process of building modular data systems easy and efficient.
Ibis as a Standard Dataframe Language
Ibis is an engine-agnostic framework for querying data. This means that no matter which engine you’re using to wrangle your data, Ibis’s syntax and concepts are the same —whether your data is in a local pandas DataFrame or if it’s a Parquet file stored on a Hadoop Distributed File System (HDFS) queried through PySpark. Ibis is a unifying dataframe language that lets you easily switch between tools (backends) whenever you need to.
Currently, Ibis supports 18+ tools across the Python data ecosystem and the Ibis team is growing this list at a rapid rate. If you’d like to give it a spin, the Ibis team created a handy ‘Getting Started Guide’ that walks you through your first steps.
Voltron Data designs and builds composable data systems for organizations using open source projects like Ibis, Arrow, Substrait, and more. Learn about our approach by visiting our Product page.
Photo by Lucas Gallone