Jul 06, 2023
Use LangChain & Ibis to Chat with Data Stored Anywhere
Marlene Mhangami
![Inside a building complex full of windows and beams](/assets/images/resources/2023-07-06/ricardo-gomez-angel-LsJpi7bH8L4-unsplash.jpg)
If you’re a developer interested in AI, you’ve probably heard of LangChain. It’s one of the fastest-growing frameworks in the world and currently has almost 50k stars on GitHub. But what exactly is LangChain, and why should you be using it?
If you’ve ever tried ChatGPT, LLAMA, or LaMDA you’ll know that these Large Language Models (LLMs) offer an impressive amount of general knowledge and can answer a wide range of questions. But what if you want to ask questions about your own data, whether that’s a book, PDF, or proprietary database? This is where LangChain comes into play by allowing LLMs to interact with your pre-existing data.
While this is a valuable proposition, to get the most out of this tool you need to be able to easily connect LangChain to your data, no matter where that data is stored. This can be quite difficult, particularly if you have data stored across multiple systems. To solve this problem we’ll use Ibis which gives you the flexibility to quickly connect LLMs to multiple data sources, across multiple systems, in a few lines of code.
What is LangChain?
- Data-aware: connect a language model to other sources of data
- Agentic: allow a language model to interact with its environment
LangChain allows you to connect your own sources of data to LLMs, like OpenAI’s GPT-4, or Meta’s open source Llama model. With it, you can reference an entire database filled with your own data and get answers about it from the LLM of your choice. LangChain is currently available as either a Python or TypeScript package.
Why Use LangChain?
At a conference I recently attended, I heard Paul Nordstrom, who was one of the first lead engineers at Amazon, say ‘…the most powerful thing in the world are abstractions.’ I agree with this sentiment. Tools that help users interact with and build on cutting-edge systems with minimal onboarding time end up on top and this has been the case with LangChain.
There are three main value propositions of LangChain: Components, Chains, and Agents. Let’s look at each of these.
Diagram inspired by Rabbitmetrics YouTube Video:
LangChain Explained in 13 Minutes | Quick Start Tutorial for Beginners
Components: These are abstractions or modules for working with language models and include LLM Wrappers, Prompt Templates and Indexes for relevant information retrieval. The components are modular making it easy to build a system that fits your task. These components can be linked into “chains” for tailored workflows.
Chains: Chains are a group of components that have been assembled to complete a specific task. They return a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case. The most commonly used type of chain is an LLMChain, which combines a PromptTemplate, a Model, and Guardrails to take user input, format it accordingly, pass it to the model and get a response, and then validate and fix (if necessary) the model output.
Agents: An agent is a component that has access to a suite of tools and can decide which tool to use based on the user’s input. **These allow LLMs to interact with their environment in a specific way. There are two main types of agents: “Action Agents” and “Plan-and-Execute Agents”. Action Agents decide an action to take and execute that action one step at a time. These are more conventional and work best for small tasks. Plan-and-Execute Agents first decide a plan of actions to take, and then execute those actions one at a time. These are great for more complex or long-running tasks. All of these are useful, and overall show that it’s modularity in particular makes LangChain exciting. The ability to switch between models and reuse existing code with the same prompt templates and chains frees users from having to commit to a particular brand of LLM or figure out how to work with foundational models from scratch. LangChain supports LLMs from OpenAI, Cohere, HuggingFace, and GPT4ALL, to name a few. In the past companies have been reluctant to use commercial LLMs because of privacy concerns. LangChain provides alternatives. Having the option to choose which model to use on your data and when to use it for a specific task is a superpower LangChain provides. This is only one of several ways LangChain’s composability adds value to users.
Using Ibis With LangChain For More Composability
Ibis is a powerful companion to LangChain for similar reasons, but instead of empowering composability with LLMs, it does so with data. Ibis is a Python dataframe library that allows you to connect and analyze data of any size, stored anywhere. It supports 17+ backends including MySQL, PostgreSQL, SnowflakeDB, DuckDB, and Google BigQuery. One might ask, why not just use pandas? Pandas certainly continues to be one of the first Python dataframe libraries data scientists reach for when working with data. However, as datasets have grown larger, it’s been stretched to its limit and breaks when data is too big. SQL-based systems handle larger data better, but learning a new flavor of SQL, depending on the system you’re using, can be time-consuming. With a similar ideology to LangChain, Ibis lets you reuse the same Python code when switching between databases and then generates the SQL needed to query your database of choice. This allows developers to work with data of any size stored anywhere and improves developer productivity and ultimately leads to a lower total cost of ownership (TCO).
Code Example: Chatting to Google Ads Data Stored in the Cloud
Now that we have a better understanding of the tools we’re using let’s look at some code. As an example, we’ll be accessing data stored in the cloud through Google BigQuery.
Connecting to the data
For this code to run smoothly, you’ll need to install the following dependencies.
!pip install 'ibis-framework[bigquery]'
!pip install langchain
!pip install openai
We’ll start by using Ibis to connect to a Google BigQuery public dataset with demo data from the Google Ads account of an e-commerce site.
import os
import ibis
ibis.options.interactive = True
#add the credentials from your google cloud account
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/username/folder/project_name.json"
#connect to bq using your project_id and dataset_id
bqconn = ibis.bigquery.connect(
project_id='userproject-abcdef',
dataset_id='bigquery-public-data.google_analytics_sample'
)
#this grabs a table with data from a google analytics session for one day
ga = bqconn.table('ga_sessions_20160801', database='bigquery-public-data.google_analytics_sample')
If you need some guidance on how to get your Google Application Credentials check out this post I wrote. After running this code we have a table that contains ad data for the first of August 2016. The table should look like this:
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ visitorId ┃ visitNumber ┃ visitId ┃ visitStartTime ┃ date ┃ totals ┃ trafficSource ┃ device ┃ geoNetwork ┃ customDimensions ┃ hits ┃ fullVisitorId ┃ userId ┃ channelGrouping ┃ socialEngagementType ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ int64 │ int64 │ int64 │ string │ struct<visits: int64, hits: int64, pageviews: int64, timeOnSite: int64, bounces… │ struct<referralPath: string, campaign: string, source: string, medium: string, … │ struct<browser: string, browserVersion: string, browserSize: string, operatingS… │ struct<continent: string, subContinent: string, country: string, region: string… │ array<struct<index: int64, value: string>> │ array<struct<hitNumber: int64, time: int64, hour: int64, minute: int64, isSecur… │ string │ string │ string │ string │ ├───────────┼─────────────┼────────────┼────────────────┼──────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────┼────────┼─────────────────┼──────────────────────┤ │ NULL │ 1 │ 1470046245 │ 1470046245 │ 20160801 │ {'visits': 1, 'hits': 24, ... +11} │ {'referralPath': None, 'campaign': '(not set)', ... +7} │ {'browser': 'Firefox', 'browserVersion': 'not available in demo dataset', ... +15} │ {'continent': 'Europe', 'subContinent': 'Western Europe', ... +9} │ [{...}] │ [{...}, {...}, ... +22] │ 895954260133011192 │ NULL │ Organic Search │ Not Socially Engaged │ │ NULL │ 1 │ 1470084717 │ 1470084717 │ 20160801 │ {'visits': 1, 'hits': 24, ... +11} │ {'referralPath': None, 'campaign': '(not set)', ... +7} │ {'browser': 'Internet Explorer', 'browserVersion': 'not available in demo dataset', ... +15} │ {'continent': 'Americas', 'subContinent': 'Northern America', ... +9} │ [{...}] │ [{...}, {...}, ... +22] │ 0288478011259077136 │ NULL │ Direct │ Not Socially Engaged │ │ NULL │ 3 │ 1470078988 │ 1470078988 │ 20160801 │ {'visits': 1, 'hits': 27, ... +11} │ {'referralPath': None, 'campaign': '(not set)', ... +7} │ {'browser': 'Safari', 'browserVersion': 'not available in demo dataset', ... +15} │ {'continent': 'Americas', 'subContinent': 'Northern America', ... +9} │ [{...}] │ [{...}, {...}, ... +25] │ 6440789996634275026 │ NULL │ Organic Search │ Not Socially Engaged │ │ NULL │ 4 │ 1470075581 │ 1470075581 │ 20160801 │ {'visits': 1, 'hits': 27, ... +11} │ {'referralPath': '/', 'campaign': '(not set)', ... +7} │ {'browser': 'Chrome', 'browserVersion': 'not available in demo dataset', ... +15} │ {'continent': 'Americas', 'subContinent': 'Northern America', ... +9} │ [{...}] │ [{...}, {...}, ... +25] │ 8520115029387302083 │ NULL │ Referral │ Not Socially Engaged │ │ NULL │ 30 │ 1470099026 │ 1470099026 │ 20160801 │ {'visits': 1, 'hits': 27, ... +11} │ {'referralPath': None, 'campaign': '(not set)', ... +7} │ {'browser': 'Chrome', 'browserVersion': 'not available in demo dataset', ... +15} │ {'continent': 'Americas', 'subContinent': 'Northern America', ... +9} │ [{...}] │ [{...}, {...}, ... +25] │ 6792260745822342947 │ NULL │ Organic Search │ Not Socially Engaged │ └───────────┴─────────────┴────────────┴────────────────┴──────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┴────────┴─────────────────┴──────────────────────┘
Connecting to the LLM
For this example, I’ll use OpenAI’s gpt-3.5-turbo
model. To access this model, we’ll need to provide and set an OpenAI API key. Once that’s set we can connect to the model by importing ChatOpenAI
from LangChain’s chat models. I’ll set the temperature attribute (which determines how creative you want the model to be) to 0 since we want the answers to be as objective as possible.
os.environ['OPENAI_API_KEY'] = 'apikey'
from langchain.chat_models import ChatOpenAI
chat = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.0)
Currently, the main tabular data LangChain agent I’ve found is called create_pandas_dataframe_agent
. We’ll need to import it and pass a pandas dataframe to it for analysis. In the last line of the previous code block, we created a table with an Ibis connection to BigQuery, naming it ga
. Conveniently we can change this table into a pandas dataframe by simply calling ga.execute()
.
This is why Ibis is so valuable in the process — we’re connecting a data system that currently does not have a wrapper in LangChain to the LLM of our choice. In this example, it’s Google BigQuery, but you could easily replace that with SnowflakeDB or PostgreSQL. Ibis makes these connections seamless and fast. Adding to this, using Ibis and LangChain together, we can change the data source or the LLM without having to rework any logic in the code.
Here’s all the code we need to give the agent our data so it’s ready to answer our questions. I set the verbose parameter to True so we can see how the agent is working.
from langchain.agents import create_pandas_dataframe_agent
ga_pd = ga.execute()
ga_pd_agent = create_pandas_dataframe_agent(chat, ga_pd, verbose=True)
Great! Now we’re ready to ask our agent some questions about our table. The first thing I wanted to know was what was the total number of page views of the site that day. Here’s the code I used. (As a forewarning you’ll probably get an error if you run this.)
#this gives us an error because the model struggles to get the nested data in the totals column
original_agent.run('What was the total number of page views?')
If you ran the above code and got an 'Agent stopped due to iteration limit or time limit.'
error as I did, this is because our table is too complex. If you look back, you’ll see that several columns have nested data (dictionaries or lists of lists in a column.) The page views are nested in a dictionary within the totals column. The model I used timed out after trying to access and analyze that data.
To optimize your models’ chances of success and to save money it’s best to wrangle your data so you provide the columns you need in a format that is as simple as possible. Ibis not only helps you access data from different sources, but it’s also great for data wrangling. We’ll use it to select relevant columns and unpack the nested data. I’ll create one table called ga_totals
that contains things like visitor id, page views, and transaction revenue. I’ll also create another table called ga_source
that lets us know information like source, ad content, and which continent or country the visitor came from.
#isolating the columns and unnesting the data using the unpack method. We also call execute to immediately change the results to a pandas dataframe.
ga_totals = ga['date','fullVisitorId','visitNumber','totals'].unpack('totals').execute()
ga_source = ga['date','fullVisitorId','trafficSource','channelGrouping', 'geoNetwork', 'socialEngagementType'].unpack('trafficSource')
ga_source = ga_source.unpack('geoNetwork').execute()
Let’s preview the first table to see the result.
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ date ┃ fullVisitorId ┃ visitNumber ┃ visits ┃ hits ┃ pageviews ┃ timeOnSite ┃ bounces ┃ transactions ┃ transactionRevenue ┃ newVisits ┃ screenviews ┃ uniqueScreenviews ┃ timeOnScreen ┃ totalTransactionRevenue ┃ sessionQualityDim ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │ ├──────────┼─────────────────────┼─────────────┼────────┼───────┼───────────┼────────────┼─────────┼──────────────┼────────────────────┼───────────┼─────────────┼───────────────────┼──────────────┼─────────────────────────┼───────────────────┤ │ 20160801 │ 895954260133011192 │ 1 │ 1 │ 24 │ 17 │ 287 │ NULL │ NULL │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 20160801 │ 0288478011259077136 │ 1 │ 1 │ 24 │ 18 │ 962 │ NULL │ NULL │ NULL │ 1 │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 20160801 │ 6440789996634275026 │ 3 │ 1 │ 27 │ 17 │ 283 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 20160801 │ 8520115029387302083 │ 4 │ 1 │ 27 │ 19 │ 371 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ │ 20160801 │ 6792260745822342947 │ 30 │ 1 │ 27 │ 17 │ 210 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ └──────────┴─────────────────────┴─────────────┴────────┴───────┴───────────┴────────────┴─────────┴──────────────┴────────────────────┴───────────┴─────────────┴───────────────────┴──────────────┴─────────────────────────┴───────────────────┘
Now we can create a new agent for the tables and use these to get fast reliable responses.
totals_agent = create_pandas_dataframe_agent(chat, ga_totals, verbose=True)
source_agent = create_pandas_dataframe_agent(chat, ga_source, verbose=True)
Let’s ask the totals_agent
what the total page views are and see the response.
totals_agent.run('What was the total number of page views?')
Entering new chain...
Thought: I need to sum up the values in the 'pageviews' column of the dataframe.
Action: python_repl_ast
Action Input: df['pageviews'].sum()
Observation: 9843
Thought: I have the sum of all pageviews in the dataframe.
Final Answer: The total number of page views is 9843.
Finished chain.
'The total number of page views is 9843.'
Excellent, that works as quickly and as expected. Since we set the verbose parameter to true we can see the agent chaining thoughts together to get the final answer. It’s a little dystopian to read but useful to verify that you’re getting the correct results.
Let’s follow this up by asking what the total transaction revenue was for the day.
totals_agent.run('What was the total transaction revenue?')
Entering new chain...
Thought: I need to sum up the values in the 'transactionRevenue' column.
Action: python_repl_ast
Action Input: df['transactionRevenue'].sum()
Observation: 6116060000.0
Thought:I now know the total transaction revenue is 6116060000.0.
Final Answer: 6116060000.0
Finished chain.
'6116060000.0'
This is very helpful information that we are able to get in seconds! Let’s do one more query, but now let’s ask the source_agent
. We’ll find out what the main source of our visitor traffic was (for example google, directly from the site, social media, etc.)
source_agent.run('What was the main source of site visits?')
Entering new chain...
Thought: We need to look at the "source" column of the dataframe and find the most common value.
Action: python_repl_ast
Action Input:
df['source'].value_counts().head(1)
Observation: (direct) 872
Name: source, dtype: int64
Thought: The most common source of site visits is "(direct)".
Final Answer: The main source of site visits is "(direct)".
Finished chain.
'The main source of site visits is "(direct)".'
Most visitors came directly to the site!
Wrapping Up
The ability to chat directly to data stored anywhere is powerful and can increase productivity dramatically. Agents like the ones we just created can be automated to send daily, weekly, or monthly digests with minimal effort. Using LangChain and Ibis together means anyone can go from raw data to meaningful insights in a few short sentences. If you’d like to try this out yourself you can find the notebook here.
Voltron Data designs and builds composable enterprise data systems using open source tools like Ibis. If you’re interested in learning more, check out our product page.
Photo by Ricardo Gomez Angel