Jul 06, 2023

Use LangChain & Ibis to Chat with Data Stored Anywhere

Marlene Mhangami

Inside a building complex full of windows and beams
note icon
TL;DR LangChain is an AI framework that allows users to leverage large language models (LLMs) on their own data. Ibis accelerates the adoption of LangChain by allowing developers to connect and query their data using AI in a few short lines of code as demonstrated in this blog.

If you’re a developer interested in AI, you’ve probably heard of LangChain. It’s one of the fastest-growing frameworks in the world and currently has almost 50k stars on GitHub. But what exactly is LangChain, and why should you be using it?

If you’ve ever tried ChatGPT, LLAMA, or LaMDA you’ll know that these Large Language Models (LLMs) offer an impressive amount of general knowledge and can answer a wide range of questions. But what if you want to ask questions about your own data, whether that’s a book, PDF, or proprietary database? This is where LangChain comes into play by allowing LLMs to interact with your pre-existing data.

While this is a valuable proposition, to get the most out of this tool you need to be able to easily connect LangChain to your data, no matter where that data is stored. This can be quite difficult, particularly if you have data stored across multiple systems. To solve this problem we’ll use Ibis which gives you the flexibility to quickly connect LLMs to multiple data sources, across multiple systems, in a few lines of code.

What is LangChain?

note icon
According to their website: LangChain is a framework for developing applications powered by language models. It enables applications that are:
  • Data-aware: connect a language model to other sources of data
  • Agentic: allow a language model to interact with its environment

LangChain allows you to connect your own sources of data to LLMs, like OpenAI’s GPT-4, or Meta’s open source Llama model. With it, you can reference an entire database filled with your own data and get answers about it from the LLM of your choice. LangChain is currently available as either a Python or TypeScript package.

Why Use LangChain?

At a conference I recently attended, I heard Paul Nordstrom, who was one of the first lead engineers at Amazon, say ‘…the most powerful thing in the world are abstractions.’ I agree with this sentiment. Tools that help users interact with and build on cutting-edge systems with minimal onboarding time end up on top and this has been the case with LangChain.

There are three main value propositions of LangChain: Components, Chains, and Agents. Let’s look at each of these.

Three Value Propositions of LangChain

Diagram inspired by Rabbitmetrics YouTube Video:
LangChain Explained in 13 Minutes | Quick Start Tutorial for Beginners

Components: These are abstractions or modules for working with language models and include LLM Wrappers, Prompt Templates and Indexes for relevant information retrieval. The components are modular making it easy to build a system that fits your task. These components can be linked into “chains” for tailored workflows.

Chains: Chains are a group of components that have been assembled to complete a specific task. They return a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case. The most commonly used type of chain is an LLMChain, which combines a PromptTemplate, a Model, and Guardrails to take user input, format it accordingly, pass it to the model and get a response, and then validate and fix (if necessary) the model output.

Agents: An agent is a component that has access to a suite of tools and can decide which tool to use based on the user’s input. **These allow LLMs to interact with their environment in a specific way. There are two main types of agents: “Action Agents” and “Plan-and-Execute Agents”. Action Agents decide an action to take and execute that action one step at a time. These are more conventional and work best for small tasks. Plan-and-Execute Agents first decide a plan of actions to take, and then execute those actions one at a time. These are great for more complex or long-running tasks. All of these are useful, and overall show that it’s modularity in particular makes LangChain exciting. The ability to switch between models and reuse existing code with the same prompt templates and chains frees users from having to commit to a particular brand of LLM or figure out how to work with foundational models from scratch. LangChain supports LLMs from OpenAI, Cohere, HuggingFace, and GPT4ALL, to name a few. In the past companies have been reluctant to use commercial LLMs because of privacy concerns. LangChain provides alternatives. Having the option to choose which model to use on your data and when to use it for a specific task is a superpower LangChain provides. This is only one of several ways LangChain’s composability adds value to users.

Using Ibis With LangChain For More Composability

Ibis is a powerful companion to LangChain for similar reasons, but instead of empowering composability with LLMs, it does so with data. Ibis is a Python dataframe library that allows you to connect and analyze data of any size, stored anywhere. It supports 17+ backends including MySQL, PostgreSQL, SnowflakeDB, DuckDB, and Google BigQuery. One might ask, why not just use pandas? Pandas certainly continues to be one of the first Python dataframe libraries data scientists reach for when working with data. However, as datasets have grown larger, it’s been stretched to its limit and breaks when data is too big. SQL-based systems handle larger data better, but learning a new flavor of SQL, depending on the system you’re using, can be time-consuming. With a similar ideology to LangChain, Ibis lets you reuse the same Python code when switching between databases and then generates the SQL needed to query your database of choice. This allows developers to work with data of any size stored anywhere and improves developer productivity and ultimately leads to a lower total cost of ownership (TCO).

Code Example: Chatting to Google Ads Data Stored in the Cloud

Now that we have a better understanding of the tools we’re using let’s look at some code. As an example, we’ll be accessing data stored in the cloud through Google BigQuery.

Connecting to the data

For this code to run smoothly, you’ll need to install the following dependencies.

!pip install 'ibis-framework[bigquery]'
!pip install langchain
!pip install openai

We’ll start by using Ibis to connect to a Google BigQuery public dataset with demo data from the Google Ads account of an e-commerce site.

import os
import ibis
ibis.options.interactive = True

#add the credentials from your google cloud account
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/username/folder/project_name.json"

#connect to bq using your project_id and dataset_id 
bqconn = ibis.bigquery.connect(
    project_id='userproject-abcdef', 
    dataset_id='bigquery-public-data.google_analytics_sample'
)

#this grabs a table with data from a google analytics session for one day
ga = bqconn.table('ga_sessions_20160801', database='bigquery-public-data.google_analytics_sample')

If you need some guidance on how to get your Google Application Credentials check out this post I wrote. After running this code we have a table that contains ad data for the first of August 2016. The table should look like this:

┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ visitorId  visitNumber  visitId     visitStartTime  date      totals                                                                            trafficSource                                                                     device                                                                                        geoNetwork                                                                        customDimensions                            hits                                                                                       fullVisitorId        userId  channelGrouping  socialEngagementType ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ int64int64int64int64stringstruct<visits: int64, hits: int64, pageviews: int64, timeOnSite: int64, bounces…struct<referralPath: string, campaign: string, source: string, medium: string, …struct<browser: string, browserVersion: string, browserSize: string, operatingS…struct<continent: string, subContinent: string, country: string, region: string…array<struct<index: int64, value: string>>array<struct<hitNumber: int64, time: int64, hour: int64, minute: int64, isSecur…stringstringstringstring               │
├───────────┼─────────────┼────────────┼────────────────┼──────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────┼────────┼─────────────────┼──────────────────────┤
│      NULL11470046245147004624520160801{'visits': 1, 'hits': 24, ... +11}{'referralPath': None, 'campaign': '(not set)', ... +7}{'browser': 'Firefox', 'browserVersion': 'not available in demo dataset', ... +15}{'continent': 'Europe', 'subContinent': 'Western Europe', ... +9}[{...}][{...}, {...}, ... +22]895954260133011192 NULLOrganic Search Not Socially Engaged │
│      NULL11470084717147008471720160801{'visits': 1, 'hits': 24, ... +11}{'referralPath': None, 'campaign': '(not set)', ... +7}{'browser': 'Internet Explorer', 'browserVersion': 'not available in demo dataset', ... +15}{'continent': 'Americas', 'subContinent': 'Northern America', ... +9}[{...}][{...}, {...}, ... +22]0288478011259077136NULLDirect         Not Socially Engaged │
│      NULL31470078988147007898820160801{'visits': 1, 'hits': 27, ... +11}{'referralPath': None, 'campaign': '(not set)', ... +7}{'browser': 'Safari', 'browserVersion': 'not available in demo dataset', ... +15}{'continent': 'Americas', 'subContinent': 'Northern America', ... +9}[{...}][{...}, {...}, ... +25]6440789996634275026NULLOrganic Search Not Socially Engaged │
│      NULL41470075581147007558120160801{'visits': 1, 'hits': 27, ... +11}{'referralPath': '/', 'campaign': '(not set)', ... +7}{'browser': 'Chrome', 'browserVersion': 'not available in demo dataset', ... +15}{'continent': 'Americas', 'subContinent': 'Northern America', ... +9}[{...}][{...}, {...}, ... +25]8520115029387302083NULLReferral       Not Socially Engaged │
│      NULL301470099026147009902620160801{'visits': 1, 'hits': 27, ... +11}{'referralPath': None, 'campaign': '(not set)', ... +7}{'browser': 'Chrome', 'browserVersion': 'not available in demo dataset', ... +15}{'continent': 'Americas', 'subContinent': 'Northern America', ... +9}[{...}][{...}, {...}, ... +25]6792260745822342947NULLOrganic Search Not Socially Engaged │
└───────────┴─────────────┴────────────┴────────────────┴──────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┴────────┴─────────────────┴──────────────────────┘

Connecting to the LLM

For this example, I’ll use OpenAI’s gpt-3.5-turbo model. To access this model, we’ll need to provide and set an OpenAI API key. Once that’s set we can connect to the model by importing ChatOpenAI from LangChain’s chat models. I’ll set the temperature attribute (which determines how creative you want the model to be) to 0 since we want the answers to be as objective as possible.

os.environ['OPENAI_API_KEY'] = 'apikey'

from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.0)

Currently, the main tabular data LangChain agent I’ve found is called create_pandas_dataframe_agent. We’ll need to import it and pass a pandas dataframe to it for analysis. In the last line of the previous code block, we created a table with an Ibis connection to BigQuery, naming it ga. Conveniently we can change this table into a pandas dataframe by simply calling ga.execute().

This is why Ibis is so valuable in the process — we’re connecting a data system that currently does not have a wrapper in LangChain to the LLM of our choice. In this example, it’s Google BigQuery, but you could easily replace that with SnowflakeDB or PostgreSQL. Ibis makes these connections seamless and fast. Adding to this, using Ibis and LangChain together, we can change the data source or the LLM without having to rework any logic in the code.

Here’s all the code we need to give the agent our data so it’s ready to answer our questions. I set the verbose parameter to True so we can see how the agent is working.

from langchain.agents import create_pandas_dataframe_agent

ga_pd = ga.execute()
ga_pd_agent = create_pandas_dataframe_agent(chat, ga_pd, verbose=True)

Great! Now we’re ready to ask our agent some questions about our table. The first thing I wanted to know was what was the total number of page views of the site that day. Here’s the code I used. (As a forewarning you’ll probably get an error if you run this.)

#this gives us an error because the model struggles to get the nested data in the totals column 
original_agent.run('What was the total number of page views?')

If you ran the above code and got an 'Agent stopped due to iteration limit or time limit.' error as I did, this is because our table is too complex. If you look back, you’ll see that several columns have nested data (dictionaries or lists of lists in a column.) The page views are nested in a dictionary within the totals column. The model I used timed out after trying to access and analyze that data.

To optimize your models’ chances of success and to save money it’s best to wrangle your data so you provide the columns you need in a format that is as simple as possible. Ibis not only helps you access data from different sources, but it’s also great for data wrangling. We’ll use it to select relevant columns and unpack the nested data. I’ll create one table called ga_totals that contains things like visitor id, page views, and transaction revenue. I’ll also create another table called ga_source that lets us know information like source, ad content, and which continent or country the visitor came from.

#isolating the columns and unnesting the data using the unpack method. We also call execute to immediately change the results to a pandas dataframe. 
ga_totals = ga['date','fullVisitorId','visitNumber','totals'].unpack('totals').execute()
ga_source = ga['date','fullVisitorId','trafficSource','channelGrouping', 'geoNetwork', 'socialEngagementType'].unpack('trafficSource')
ga_source = ga_source.unpack('geoNetwork').execute()

Let’s preview the first table to see the result.

┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ date      fullVisitorId        visitNumber  visits  hits   pageviews  timeOnSite  bounces  transactions  transactionRevenue  newVisits  screenviews  uniqueScreenviews  timeOnScreen  totalTransactionRevenue  sessionQualityDim ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ stringstringint64int64int64int64int64int64int64int64int64int64int64int64int64int64             │
├──────────┼─────────────────────┼─────────────┼────────┼───────┼───────────┼────────────┼─────────┼──────────────┼────────────────────┼───────────┼─────────────┼───────────────────┼──────────────┼─────────────────────────┼───────────────────┤
│ 20160801895954260133011192 112417287NULLNULLNULL1NULLNULLNULLNULLNULL │
│ 201608010288478011259077136112418962NULLNULLNULL1NULLNULLNULLNULLNULL │
│ 201608016440789996634275026312717283NULLNULLNULLNULLNULLNULLNULLNULLNULL │
│ 201608018520115029387302083412719371NULLNULLNULLNULLNULLNULLNULLNULLNULL │
│ 2016080167922607458223429473012717210NULLNULLNULLNULLNULLNULLNULLNULLNULL │
└──────────┴─────────────────────┴─────────────┴────────┴───────┴───────────┴────────────┴─────────┴──────────────┴────────────────────┴───────────┴─────────────┴───────────────────┴──────────────┴─────────────────────────┴───────────────────┘

Now we can create a new agent for the tables and use these to get fast reliable responses.

totals_agent = create_pandas_dataframe_agent(chat, ga_totals, verbose=True)
source_agent = create_pandas_dataframe_agent(chat, ga_source, verbose=True)

Let’s ask the totals_agent what the total page views are and see the response.

totals_agent.run('What was the total number of page views?')
Entering new  chain...
    Thought: I need to sum up the values in the 'pageviews' column of the dataframe.
    Action: python_repl_ast
    Action Input: df['pageviews'].sum()
    Observation: 9843
    Thought: I have the sum of all pageviews in the dataframe.
    Final Answer: The total number of page views is 9843.
    
    Finished chain.

    'The total number of page views is 9843.'

Excellent, that works as quickly and as expected. Since we set the verbose parameter to true we can see the agent chaining thoughts together to get the final answer. It’s a little dystopian to read but useful to verify that you’re getting the correct results.

Let’s follow this up by asking what the total transaction revenue was for the day.

totals_agent.run('What was the total transaction revenue?')
Entering new  chain...
    Thought: I need to sum up the values in the 'transactionRevenue' column.
    Action: python_repl_ast
    Action Input: df['transactionRevenue'].sum()
    Observation: 6116060000.0
    Thought:I now know the total transaction revenue is 6116060000.0.
    Final Answer: 6116060000.0
    
    Finished chain.

    '6116060000.0'

This is very helpful information that we are able to get in seconds! Let’s do one more query, but now let’s ask the source_agent. We’ll find out what the main source of our visitor traffic was (for example google, directly from the site, social media, etc.)

source_agent.run('What was the main source of site visits?')
Entering new  chain...
    Thought: We need to look at the "source" column of the dataframe and find the most common value.
    Action: python_repl_ast
    Action Input:
    df['source'].value_counts().head(1)
    Observation: (direct)    872
    Name: source, dtype: int64
    Thought: The most common source of site visits is "(direct)".
    Final Answer: The main source of site visits is "(direct)".
    
    Finished chain.

    'The main source of site visits is "(direct)".'

Most visitors came directly to the site!

Wrapping Up

The ability to chat directly to data stored anywhere is powerful and can increase productivity dramatically. Agents like the ones we just created can be automated to send daily, weekly, or monthly digests with minimal effort. Using LangChain and Ibis together means anyone can go from raw data to meaningful insights in a few short sentences. If you’d like to try this out yourself you can find the notebook here.

Voltron Data designs and builds composable enterprise data systems using open source tools like Ibis. If you’re interested in learning more, check out our product page.

Photo by Ricardo Gomez Angel