Crafting Your Own AI: A Journey into Personalized ChatGPT Integration

Introduction

ChatGPT has rapidly become a global sensation. Its proficiency lies in offering broad-spectrum knowledge, but its understanding is rooted in pre-2021 publicly accessible internet information. It does not access personal data or incorporate the latest data sources.

Imagine the possibilities if it could!

This blog offers a step-by-step guide on customizing your own ChatGPT model to focus on a particular data set. Accompanied by a GitHub repository, this post provides essential code references. The primary focus is on text data handling. For insights on integrating natural language processing with other data sources, refer to the following tutorials:

High Level Overview

Essentially, there are two main elements in configuring ChatGPT for your specific dataset: (1) ingestion of the data, (2) chatbot over the data. We'll delve into the key steps required for each component.

Ingestion of data

Ingestion involves several steps. The steps are:

Load data sources to text: this involves loading your data from arbitrary sources to text in a form that it can be used downstream.
Chunk text: this involves chunking the loaded text into smaller chunks. This is necessary because language models generally have a limit to the amount of text (tokens) they can deal with. "Chunk size" is something to be tuned over time.
Embed text: this involves creating a numerical embedding for each chunk of text. This is necessary because we only want to select the most relevant chunks of text for a given question, and we will do this by finding the most similar chunks in the embedding space.
Load embeddings to vectorstore: this involves putting embeddings and documents into a vectorstore. Vectorstores help us find the most similar chunks in the embedding space quickly and efficiently.

Langchain strives to be modular, so that each of these steps are straightforward to swap out with other components or approaches.

Querying of Data

This can also be broken down into a few steps. The high level steps are:

Get input from the user: we'll use a web interface and a cli interface to receive input from the user about the documents.
Combine that input with chat history: we'll combine chat history and a new question into a single standalone question. This is often necessary because we want to allow for the ability to ask follow up questions (an important UX consideration).
Lookup relevant documents: using the vectorstore created during ingestion, we will look up relevant documents for the answer.
Generate a response: Given the standalone question and the relevant documents, we will use a language model to generate a response.

In this post, we'll explore some design decisions you have with history, prompts, and the chat experience. We won't touch on the deployment, but for more information see deployment guide.

Step by Step Details

This section dives into more detail on the steps necessary to ingest data.

See Langchain documention (and over 120 data loaders) for more information about document loaders.

Load data

First, we need to load data into a standard format. In langchain, a Document consists of (1) the text itself, (2) any metadata associated with that text (where it came from, etc). This is often critical for understanding and communicating the context for testing or for the end user.

The line below contains the line of code responsible for loading the relevant documents.

print("Loading data...")
loader = UnstructuredFileLoader("state_of_the_union.txt")
raw_documents = loader.load()

PDFs: Using the PyPDF loader to load PDFs, showcasing how each page of a PDF becomes a unique document.

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
page = pages[0]
print(page.page_content[0:500])

MachineLearning-Lecture01
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine learning class. So what I wanna do today is ju st spend a little time going over the logistics of the class, and then we'll start to talk a bit about machine learning.
By way of introduction, my name's Andrew Ng and I'll be instru ctor for this class. And so I personally work in machine learning, and I' ve worked on it for about 15 years now, and I actually think that machine learning i

YouTube Content: Use the YouTube audio loader and OpenAI's Whisper parser to convert YouTube audio into text format.

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url="https://www.youtube.com/watch?v=JL5fqJuklC8"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=JL5fqJuklC8

[youtube] JL5fqJuklC8: Downloading webpage

[youtube] JL5fqJuklC8: Downloading ios player API JSON

[youtube] JL5fqJuklC8: Downloading android player API JSON

[youtube] JL5fqJuklC8: Downloading m3u8 information

[info] JL5fqJuklC8: Downloading 1 format(s): 140

[download] Destination: docs/youtube//Sales Management - Các điểm nổi bật của phần mềm.m4a

[download] 100% of 17.70MiB in 00:00:05 at 3.51MiB/s

[FixupM4a] Correcting container of "docs/youtube//Sales Management - Các điểm nổi bật của phần mềm.m4a"

[ExtractAudio] Not converting audio docs/youtube//Sales Management - Các điểm nổi bật của phần mềm.m4a; file is already in target format m4a

Transcribing part 1!

docs[0].page_content[0:1500]
docs[0].page_content[0:1500]

'Với sự phát triển của công nghệ, đã thay đổi mọi khía cạnh trong cuộc sống. Công nghệ không chỉ tạo ra những tiến bộ đáng kinh ngạc mà còn mở ra những cơ hội mới. Thông qua điện thoại thông minh, chúng ta có thể thực hiện công việc chỉ bằng một cuốn chàm tay, dễ dàng xử lý mọi lúc, mọi nơi. Ngành hàng FMCG cũng không ngoại lệ, các đối thủ liên tục cải thiện và nâng cao khả năng quản lý. Công ty chúng ta cũng theo mô hình thực thi bán lẻ, sử dụng phần mềm DMS như trung tâm cầu nối và đầu vào trai dữ liệu cho mọi đối tượng trong quá trình vận hành. Tuy nhiên, dữ liệu hiện tại chỉ ở dạng cơ bản và cần phải trải qua nhiều bước xử lý để có báo cáo đáp ứng nhu cầu đa dạng, phức tạp, phù hợp với đặc tính riêng của phòng ban. Bên cạnh đó, vị trí chỉ tiêu doanh thui chưa phù hợp với thực tế, còn dựa vào cảm tính. Các quản lý gặp thiếu công cụ theo dõi, đánh giá nhân viên. Khả năng xử lý báo cáo chậm, khả năng sử dụng Excel không đồng đều. Bộ trận tuyến bán hàng tân suất còn chưa hợp lý. Với bối cảnh đó, cần xây dựng hệ thống công cụ, cung cấp các báo cáo chuyên sâu về sức khỏe thị trường, giúp ánh giá tình hình thực tế cho quản lý và các bộ phận liên quan để sớm đưa ra quyết định kịp thời. Đặc biệt là đối với đội ngũ lực lượng thị trường, họ tập trung hoàn toàn vào công việc trên địa bàn. Do đó, việc nghiên cứu và phát triển phần mềm, ứng dụng phục vụ lực lượng thị trường và các cấp quản lý trở thành điều cần thiết và quan trọng hơn bao giờ hết. Trên ý tưởng đó, phần mềm sẽ xây dựng c'

Web Content: Loading internet URLs, highlighting the need for post-processing to make the data workable.

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/code-of-conduct.md")

Notion Data: Export data from Notion databases and load it using the Notion directory loader.

from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

Split Text

Splitting documents into smaller units of text for input into the model is critical for getting relevant information back from our chatbot. When documents are too big, you'll include irrelevant information to the model. Conversely, when they're too small, you'll not include enough information and the model may be confused about what is actually relevant.

The chunk size isn't quite a science, so you'll have to experiment to see if you can get good results.

print("Splitting text...")
text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=600,
    chunk_overlap=100,
    length_function=len,
)
documents = text_splitter.split_documents(raw_documents)

Markdown Header Text Splitter: This splitter is unique as it adds metadata from headers and subheaders to the chunks, enhancing the context and relevance of the data.

from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

md_header_splits[0]

Document(page_content='Hi this is Jim \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})

Create embeddings and store in vectorstore

Next, now that we have small chunks of text we need to create embeddings for each piece of text and store them in a vectorstore. We create embeddings because this is an efficient way of storing this text data and subsequently querying the store for documents relevant to our query.

Here we use OpenAI’s embeddings and a FAISS vectorstore and store that as a python pickle file for later use.

print("Creating vectorstore...")
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
with open("vectorstore.pkl", "wb") as f:
    pickle.dump(vectorstore, f)

Run this code to create the vectorstore. This is necessary after changing how you split the text or loading new documents. If you're making changes, adding documents, or splitting text different, you'll have to re-run things.

Utilizing Vector Stores: For example, Chroma, a lightweight, in-memory vector store, is introduced. It's chosen for its ease of use and suitability for managing small datasets, although there are over 30 different vector stores integrated with LangChain for various needs.

from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)

question = "Tell me about all-white mushrooms with large fruiting bodies"

smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

Query data

So now that we’ve ingested the data, we can now use it in a chatbot interface. In order to do this, we will use the ConversationalRetrievalChain.

There are several different options when it comes to querying the data. Do you allow follow up questions? Want to include other user context? There are lots of design decisions and below we'll discuss some of the most critical.

Do you want to have conversation history?

This is table stakes from a UX perspective because it allows for follow up questions. Adding memory is simple, you can either use a built in module.

llm = ChatOpenAI(model_name="gpt-4", temperature=0)
retriever = load_retriever()
memory = ConversationBufferMemory(
    memory_key="chat_history", return_messages=True)
# model = RetrievalQA.from_llm(llm=llm, retriever=retriever)
# if you don't want memory use the above, you will have to change
# the app.py or cli_app.py file to include `query` in the input instead of `question`
model = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory)

Alternatively, you can specify memory and pass it into the model, tracking it on your own. Run this example from the github repo with the following, then read the code in query_data.py.

python cli_app.py

Which QA model would you like to work with? [basic/with_sources/custom_prompt/condense_prompt] (basic):
Chat with your docs!
---------------
Your Question:  (what did the president say about ketanji brown?):
Answer: The President nominated Ketanji Brown Jackson to serve on the United States Supreme Court, describing her as one of the nation's top legal minds who will continue Justice Breyer's legacy of excellence. He also mentioned that she
is a former top litigator in private practice, a former federal public defender, and comes from a family of public school educators and police officers. He referred to her as a consensus builder and noted that since her nomination, she
has received a broad range of support from various groups, including the Fraternal Order of Police and former judges appointed by both Democrats and Republicans.
---------------

Do you want to customize the QA prompt?

You can easily customize the QA prompt by passing in a prompt of your choice. This is similar in experience to most all chains in langchain.

template = """You are an AI assistant for answering questions about the most recent state of the union address.
You are given the following extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
If the question is not about the most recent state of the union, politely inform them that you are tuned to only answer questions about the most recent state of the union.
Lastly, answer the question as if you were a pirate from the south seas and are just coming back from a pirate expedition where you found a treasure chest full of gold doubloons.
Question: {question}
=========
{context}
=========
Answer in Markdown:"""

QA_PROMPT = PromptTemplate(template=template, input_variables=[
                           "question", "context"])
llm = ChatOpenAI(model_name="gpt-4", temperature=0)
retriever = load_retriever()
memory = ConversationBufferMemory(
    memory_key="chat_history", return_messages=True)
model = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    combine_docs_chain_kwargs={"prompt": QA_PROMPT})

Run this example from the github repo with the following, then read the code in query_data.py.

python cli_app.py
Which QA model would you like to work with? [basic/with_sources/custom_prompt/condense_prompt] (basic): custom_prompt
Chat with your docs!
---------------
Your Question:  (what did the president say about ketanji brown?):
Answer: Arr matey, the cap'n, I mean the President, he did speak of Ketanji Brown Jackson, he did. He nominated her to the United States Supreme Court, he did, just 4 days before his address. He spoke highly of her, he did, callin' her
one of the nation's top legal minds. He believes she'll continue Justice Breyer’s legacy of excellence, he does.

She's been a top litigator in private practice, a federal public defender, and comes from a family of public school educators and police officers. She's a consensus builder, she is. Since her nomination, she's received support from all
over, from the Fraternal Order of Police to former judges appointed by both Democrats and Republicans. So, that's what the President had to say about Ketanji Brown Jackson, it is.
---------------
Your Question:  (what did the president say about ketanji brown?): who did she succeed?
Answer: Arr matey, ye be askin' about who Judge Ketanji Brown Jackson be succeedin'. From the words of the President himself, she be takin' over from Justice Breyer, continuin' his legacy of excellence on the United States Supreme
Court. Now, let's get back to countin' me gold doubloons, aye?
---------------

Do you expect long conversations?

If so, you're going to want to condense previous questions and history in order to add context into the prompt. If you embed the whole chat history along with the new question to look up relevant documents, you may pull in documents no longer relevant to the conversation (if the new question is not related at all). Therefor, this step of condensing the chat history and a new question to a standalone question is very important.

_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.
You can assume the question about the most recent state of the union address.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)


llm = ChatOpenAI(model_name="gpt-4", temperature=0)
retriever = load_retriever()
memory = ConversationBufferMemory(
    memory_key="chat_history", return_messages=True)
# see: https://github.com/langchain-ai/langchain/issues/5890
model = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    condense_question_prompt=CONDENSE_QUESTION_PROMPT,
    combine_docs_chain_kwargs={"prompt": QA_PROMPT}) # includes the custom prompt as well

Read the code in query_data.py for some example code to apply to your own projects.

Do you want the model to cite sources?

Langchain can cite source documents in the model. There's a lot you can do here, you can add your own metadata, your own sections, and other relevant information to return the most relevant metadata for your query.

llm = ChatOpenAI(model_name="gpt-4", temperature=0)
retriever = load_retriever()
history = []
model = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    return_source_documents=True)

def model_func(question):
    # bug: this doesn't work with the built-in memory
    # see: https://github.com/langchain-ai/langchain/issues/5630
    new_input = {"question": question['question'], "chat_history": history}
    result = model(new_input)
    history.append((question['question'], result['answer']))
    return result

model_func({"question":"some question you have"})
# this is the same interface as all the other models.

Run this example from the github repo with the following, then read the code in query_data.py.

python cli_app.py
Which QA model would you like to work with? [basic/with_sources/custom_prompt/condense_prompt] (basic): with_sources
Chat with your docs!
---------------
Your Question:  (what did the president say about ketanji brown?):
Answer: The President nominated Ketanji Brown Jackson to serve on the United States Supreme Court, describing her as one of the nation's top legal minds who will continue Justice Breyer's legacy of excellence. He also mentioned that she
is a former top litigator in private practice, a former federal public defender, and comes from a family of public school educators and police officers. Since her nomination, she has received a broad range of support, including from the
Fraternal Order of Police and former judges appointed by both Democrats and Republicans.
Sources:
state_of_the_union.txt
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
state_of_the_union.txt
As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.

While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military
justice.
state_of_the_union.txt
But in my administration, the watchdogs have been welcomed back.

We’re going after the criminals who stole billions in relief money meant for small businesses and millions of Americans.

And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud.

By the end of this year, the deficit will be down to less than half what it was before I took office.

The only president ever to cut the deficit by more than one trillion dollars in a single year.

Lowering your costs also means demanding more competition.
state_of_the_union.txt
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of
support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.
---------------
Your Question:  (what did the president say about ketanji brown?): where did she work before?
Answer: Before her nomination to the United States Supreme Court, Ketanji Brown Jackson worked as a Circuit Court of Appeals Judge. She was also a former top litigator in private practice and a former federal public defender.
Sources:
state_of_the_union.txt
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
state_of_the_union.txt
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of
support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.
state_of_the_union.txt
We cannot let this happen.

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for
your service.
state_of_the_union.txt
Vice President Harris and I ran for office with a new economic vision for America.

Invest in America. Educate Americans. Grow the workforce. Build the economy from the bottom up and the middle out, not from the top down.

Because we know that when the middle class grows, the poor have a ladder up and the wealthy do very well.

America used to have the best roads, bridges, and airports on Earth.

Now our infrastructure is ranked 13th in the world.

We won’t be able to compete for the jobs of the 21st Century if we don’t fix that.
---------------

Language Model

The final lever to pull is what language model you use to power your chatbot. In my example i use the OpenAI LLM, but this can easily be substituted to other language models that LangChain supports, or you can even write your own wrapper.

Putting it alll together

After making all the necessary customizations, and running python ingest_data.py, you can now interact with the chatbot.

I’ve exposed a really simple interface through which you can do. You can access this just by running python cli_app.py and this will open in the terminal a way to ask questions and get back answers. Try it out!

Happy reading