Jul 27, 2023

Use LLMs with Python UDFs to Query & Generate Tabular Data in Natural Language

Cody Peterson, Marlene Mhangami

body of water with reflection of bridge and building

As Generative AI and Large Language Models (LLMs) continue to improve they have the potential to provide a more accessible, context-aware way to interact with data. With modern LLMs like ChatGPT, users can simply provide natural language instructions to an LLM, which can then interpret and process the information accordingly. For tabular data, modern LLMs can be used to query, summarize, or generate content based on instructions given to them in English. One of the current challenges data engineers in this space may face is how to connect structured data stored in various systems to the LLM of their choice. Ibis is a Python framework that solves this problem by providing an efficient, standard way to connect data of any size stored almost anywhere.

Why Use LLMs with Tabular Data?

LLMs can be particularly helpful when extracting insights from data since they can understand natural language in a way traditional programming languages do not. Their training data gives them access to a wide range of general knowledge that they can draw from to complete various tasks. For example, say I had a table that contained the names of a number of different songs. Depending on when it was trained, I could use an LLM to create a new column that contained the release dates of the songs or even the song lyrics. This is something I could not do easily using just a programming language. Using LLMs in this way can, for example, make machine learning practitioners more productive as they can generate content for model training on the fly.

The latest LLMs can interact with tabular data and know how to query data frames fairly accurately when given instruction in English. The ability to use natural language to ask questions about data lowers the barrier to entry for individuals who may not have expertise in data analysis or programming. This means smaller teams can accomplish more than ever. For users who might feel nervous about the accuracy of results being produced by an LLM, setting the temperature to 0 will ensure results are as factual as possible. For jobs where accuracy is critical, it may be helpful to be able to switch to a programming language like Python to double-check that the LLM has carried out a specific operation correctly. With a Python library like Ibis, this is easy to do!

How Ibis Lets You Apply LLMs to Tabular Data

Using Ibis allows you to easily augment your code to add LLMs to your system where they are needed. In the Ibis 6.0.0 release, Python UDFs were added which provide a seamless way to connect tabular data to LLMs. Using a Lambda function in Ibis allows you to pass a column of data through a UDF to the LLM. Let’s take a look at some practical examples of what this looks like with code.

Example: Generating Movie Descriptions From Movie Titles

The first thing we need to do is install the necessary dependencies. In this example, we will use a model from OpenAI, but you are welcome to switch this out with a LLM provider of your choice. (GPT4All is a great open source model provider.) We will use a Jupyter Notebook to run all the code, so the OpenAI API is not called after each run.

#remove the exclamation point to run this in the terminal 
!pip install ibis-framework
!pip install openai

Next, let’s import the libraries we will need and grab some example data. We’ll be using IMDB data that Ibis provides as part of the built-in examples module. We’ll also use openai.api_key to store the key that’s needed to access their models. You can find your API Key in the user settings of your OpenAI account.

import ibis
import openai

from ibis import _
from ibis.expr.operations import udf
ibis.options.interactive = True

#replace value below with your personal openai api key
openai.api_key ='your_api_key'

#getting IMDB data
imdb_titles = ibis.examples.imdb_title_basics.fetch()
imdb_ratings = ibis.examples.imdb_title_ratings.fetch()

Once the data is available we’ll join the imdb_titles and imdb_ratings tables and do so some pre-processing to make sure we’re working with only the data that we want. We’ll also order the movies in descending order so the most highly rated ones show up first. Since this is just an example, I’ll use the limit function to grab only the first 20 rows of the table.

joined = imdb_titles.join(imdb_ratings, "tconst")
t = joined.relabel("snake_case")
t = t.filter((_.title_type == "movie") & (_.num_votes >= 50_000))
t = t.select("tconst", "primary_title", "average_rating", "num_votes")
t = t.order_by(_.average_rating.desc())
t = t.limit(20)

Let’s take a look at what the resulting table looks like.


┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ tconst    ┃ primary_title                                 ┃ average_rating ┃ num_votes ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ string    │ string                                        │ float64        │ int64     │
├───────────┼───────────────────────────────────────────────┼────────────────┼───────────┤
│ tt0111161 │ The Shawshank Redemption                      │            9.3 │   2710200 │
│ tt0068646 │ The Godfather                                 │            9.2 │   1882138 │
│ tt0050083 │ 12 Angry Men                                  │            9.0 │    800656 │
│ tt0071562 │ The Godfather Part II                         │            9.0 │   1284950 │
│ tt0108052 │ Schindler's List                              │            9.0 │   1369259 │
│ tt0468569 │ The Dark Knight                               │            9.0 │   2683164 │
│ tt0167260 │ The Lord of the Rings: The Return of the King │            9.0 │   1865229 │
│ tt0110912 │ Pulp Fiction                                  │            8.9 │   2080768 │
│ tt0060196 │ The Good, the Bad and the Ugly                │            8.8 │    769197 │
│ tt0109830 │ Forrest Gump                                  │            8.8 │   2105835 │
│ …         │ …                                             │              … │         … │
└───────────┴───────────────────────────────────────────────┴────────────────┴───────────┘

Building a User Defined Function with Ibis

As you can see, we have columns that contain the movie titles, average rating, and the number of votes a film received. To clearly demonstrate the power of LLMs we will now write a UDF that takes in the name of a movie and outputs a short description of the movie. For this example, we’ll use the Chat Completion API from OpenAI. This is a new dedicated API that is the preferred method for interacting with the GPT-3.5-Turbo and GPT-4 models. There are two main parameters to use with Chat Completion. The model parameter which, at the time of writing, supports GPT-3.5-Turbo and GPT-4 models, and the messages parameter, which lets you specify the role and content from the perspective of application developers. We’ll use the user role which allows us to give the model direct instructions and in this case will be to describe a movie. We’ll also set the temperature to 0 which tells the model to be as factual as possible. Here’s what our UDF named llm_describe looks like.

## Ibis/DuckDB UDF
@udf.scalar.python
def llm_describe(s: str) -> str:
    response = openai.ChatCompletion.create(
    model= 'gpt-3.5-turbo-16k',
    messages=[{"role": "user", "content": f"Describe the movie {s} in a few sentences."}],
    temperature= 0
    )
    return response.choices[0].message.content

We now need to pass the column of our data to the LLM so we’ll go ahead and use the mutate method with a Lambda function. This will pass each member of a column to the UDF and create a new column with the results. We’ll name the new column movie_descriptions.

t_new = t.mutate(movie_descriptions=lambda t: llm_describe(t.primary_title))

This doesn’t take much time and looking at the new column, the descriptions look accurate!


┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ tconst     ┃ primary_title                                 ┃ average_rating ┃ num_votes ┃ movie_descriptions                                                               ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string     │ string                                        │ float64        │ int64     │ string                                                                           │
├────────────┼───────────────────────────────────────────────┼────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ tt0111161  │ The Shawshank Redemption                      │            9.3 │   2710200 │ "The Shawshank Redemption" is a powerful drama that follows the story of Andy D… │
│ tt0068646  │ The Godfather                                 │            9.2 │   1882138 │ The Godfather is a classic crime drama film directed by Francis Ford Coppola. S… │
│ tt0050083  │ 12 Angry Men                                  │            9.0 │    800656 │ "12 Angry Men" is a gripping courtroom drama that takes place almost entirely i… │
│ tt0071562  │ The Godfather Part II                         │            9.0 │   1284950 │ The Godfather Part II is a 1974 crime drama film directed by Francis Ford Coppo… │
│ tt0108052  │ Schindler's List                              │            9.0 │   1369259 │ Schindler's List is a powerful and emotionally gripping film directed by Steven… │
│ tt0167260  │ The Lord of the Rings: The Return of the King │            9.0 │   1865229 │ "The Lord of the Rings: The Return of the King" is the epic conclusion to the f… │
│ tt0468569  │ The Dark Knight                               │            9.0 │   2683164 │ "The Dark Knight" is a gripping superhero film directed by Christopher Nolan. I… │
│ tt0110912  │ Pulp Fiction                                  │            8.9 │   2080768 │ Pulp Fiction is a nonlinear crime film directed by Quentin Tarantino. The movie… │
│ tt15097216 │ Jai Bhim                                      │            8.8 │    204590 │ Jai Bhim is a thought-provoking Indian drama film that revolves around the life… │
│ tt0060196  │ The Good, the Bad and the Ugly                │            8.8 │    769197 │ "The Good, the Bad and the Ugly" is a classic spaghetti western film directed b… │
│ …          │ …                                             │              … │         … │ …                                                                                │
└────────────┴───────────────────────────────────────────────┴────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────┘

Example: Query a Table in Natural Language

Generating information based on your existing columnar data is a great way to use LLMs. As I mentioned earlier, a useful function of LLMs is to query tables using a natural language like English. Newer models such as GPT-3.5-Turbo are able to understand tabular data and answer questions about a table given in English. To demonstrate how this works we will create a UDF that takes in a question about the data and the table itself. The result will be an answer to the question.

Ibis uses lazy execution, which is an evaluation strategy that delays the evaluation of an expression until its value is needed. In other words, the resulting table from a query is only returned when explicitly needed or called. This accelerates workflows, improves memory management, and makes code more readable. In order to get a table we can pass to our LLM we can either call to_pandas or execute. I’ll use the former to store the table we just made above as a pandas dataframe.

table = t_new.to_pandas()

Next, we’ll create a new UDF that can accept a question and give an answer.

## Ibis/DuckDB UDF
@udf.scalar.python
def ask_the_llm(s: str) -> str:
    response = openai.ChatCompletion.create(
    model= 'gpt-3.5-turbo-16k',
    messages=[{"role": "user", "content": f"{s}"}],
    temperature= 0
    )
    return response.choices[0].message.content

We can test out the function by passing in an f-string that contains the question and the table.

answer = ask_the_llm(f'which title has the highest number of votes{table}')
print(answer)

**'The title with the highest number of votes is "The Shawshank Redemption" with 2,710,200 votes.'**

The answer is correct and consistent with what we were expecting. Let’s test it out again by asking for the median average rating of the movies in the table.

answer = ask_the_llm(f'what is the the median average rating of all the movie titles{table}'
print(answer)

**'The median average rating of all the movie titles is 8.8.'**

To confirm that these answers are correct we can use the equivalent methods in Ibis and compare the results.

#double check the highest number of votes 
t_new.num_votes.max()

**2710200**

#double check the median average rating 
t_new.average_rating.approx_median()

**8.8**

As you can see the answers are the same!

Conclusion

Ibis lets you leverage the power of LLMs to do things that were previously not possible with tabular data. Ibis connects to a range of backends including MySQL, PostgreSQL, DuckDB, SnowflakeDB, and Google BigQuery making it possible to augment your current systems to integrate LLMs as quickly and as pain-free as possible. As shown above, you can use LLMs to generate helpful data and query tables using English. Teams can utilize these tools for faster training data generation, data labeling, and language-agnostic tabular data querying.

Voltron Data helps enterprises design and build composable data systems using open source tools like Ibis. To learn more about our approach, check out our Product page.

Photo by Eric Aiden