Training a Large Language Model on your content.
šØš¼āšā”ļøšØš¼āš¼ TLDR ā Thereās a number of approaches to getting Large Language Models to use your own private content. Increasingly large model contexts, together with the use of embedding and semantic search solutions, allow content to be injected into each model query. Such solutions can be very effective in teaching an LLM about private data it hasnāt previously been trained on.
š£ NEWS FLASH: OpenAI have annouced their āpluginsā ā effectively adopting the very architecture described here. If youāre interested in OpenAIās plugins, read on ā the approach I describe here is the same as plugins. Both use embeddings, vector databases and semantic search.
A former colleague contacted me this week to ask this question. I thought it a good one and worthy of a longer and more public answer than a direct reply. You could say that this post is dedicated to you, Jeremy! I hope both yourself and others find it interesting / useful.
Large Language Models (LLMs) like ChatGPT are trained on vast sets of natural language text. The benefit of these vast training sets is that the resultant model is pretty good at a wide variety of tasks. But there will be topics that an off-the-shelf LLM isnāt able to answer ā maybe the data isnāt publicly available and so couldnāt have been ingested by the model, or maybe itās just a very niche topic that isnāt well represented in the public data sources. Hence the desire to āteachā these models about such topics.
Prompt Engineering
When we think of teaching a model about particular topics, the first technique that I would think of is augmenting the prompt thatās sent to the model.
Rather than just send a raw question to the model, we can augment that question with some additional context and instructions to encourage a more useful response.
Sometimes instructions in the prompt are all thatās needed in order to focus the model on what we need. But we can also insert additional contextual information, for example:
You are a chat system for a bank. Your job is to answer questions about the bank's products. You are cuatious never to make a recommendation, instead providing information for the user to make their own assessments. Use the following context:
CONTEXT:
The bank offers the following products:
- A current account with an interest rate of 1.2%.
- An instant-access savings account with an interest rate of 2.3%.
- A savings account with a 30 day notice period and with an interest rate of 3.4%.
QUESTION:
{question}
By injecting information about the bankās products into the prompt, we give the model the information needed to answer the product questions that the base model cannot.
With GPT-3 and ChatGPT the context size is 2k.
With the GPT-4 base model, that grows to 8k.
With GPT-4 thereās even an option for a much larger 32k context... 32k being roughly equivalent to 50 pages of text. Thatās a lot of context we can give it ā probably more than our mythical bank has to describe its products.
Passing such a large context in each API call feels a bit profligate. But for many uses this will at least prove a feasible approach and so I introduce it as the first possible solution.
Applying guardrails
A big concern amongst many is the risk that an LLM might answer a wide variety of interesting, but ultimately brand-risky, questions. Not many banks would be comfortable with their system answering questions like āWhat do you think of Donald Trump?ā.
But no sweat, we can address this in the prompt as well:
You are chatbot for a bank. Your job is to answer questions about the bankās products. Use the following context to answer questions. If the question is not related to the context, you should reply āIām a large language model trained to answer products about our bankās products, so am not able to answer that questionā.CONTEXT ā¦
You may have heard people talk about applying āguardrailsā to models and this is a common way of doing this.
Semantic Search
What if, instead of trying to pass our entire training set to each API call, we instead identify which parts of that training set are potentially useful and just pass those? Perhaps the user isnāt asking about all of the bankās products, but instead savings accounts in particular. That means we can prune what context we send to the model. This approach is becoming known as the āSemantic Searchā pattern.
With a Semantic Search approach, we first find the pieces of our training data that might help answer the userās question and pass only those to the model to formulate an answer.
Chunking
To implement a semantic search pattern we first need to chunk up our training data into small pieces (say, about 1,000 tokens for each chunk). Thereās a variety of different strategies for doing this, but broadly weāre looking to create lots of pieces of text that hopefully have some coherence to them. If weāre lucky, we might find that each paragraph in our source maps to a chunk. If weāre unlucky, the paragraphs might be too large and so need splitting up. There are, of course, standard libraries for doing this.
Embeddings
How do we decide which chunks of our text are relevant to a given question? We do this by first of all using a machine learning model to create an āembeddingā for each chunk.
An embedding is simply a mathematical representation of the statistical pattern of words and characters in each chunk. Machine learning models have a deep understanding of those patterns and use that understanding to create each embedding for us.
In mathematical terms an embedding is just a vector. A lot of us will have learnt about vector maths in secondary school and itās a relatively easy mathematical concept. But we only need to understand vectors if weāre interested in the mechanics ā for the most part we only need to receive and store the vectors, the maths is done for us by libraries and models.
There are lots of ways to create embeddings, but one of the simplest is to use the OpenAI embeddings API. This uses GPT-3 to create an embedding and you donāt need to know much more than that! All the underlying maths is done for you ā we pass our text and get an embedding back. Simple!
Vector Stores
Once weāve got embeddings for our chunks of text, we then store them in a vector store.
Vector stores are highly specialised databases that store our embeddings and allow us to run algorithms against them (weāll come on to what those algorithms are in a moment). Likely candidates for a vector store include:
What differentiates a vector store from a traditional database is that it includes the ability to run a similarity match to compare an embedding that represents a userās question with those in our vector store. The similarity search identifies the top n embeddings which represent the pieces of our original text that can likely answer the userās question.
Cosine Similarity
For the mathematically oriented, a similarity search most often uses a cosine similarity algorithm. This is a very common algorithm that any programmer could write in just a few lines of code ā if you ask ChatGPT, itāll write it for you.
But why bother? A good vector store includes this logic and does the matching for you without you needing to write a single line of code.
Similarity Search with Embeddings is better
Itās important to understand that a similarity search that uses embeddings isnāt looking at keywords or even doing a fuzzy match, both of which are poor ways of reliably comparing text. Instead, itās comparing the mathematical representation of the statistical patterns in the original text. All the comparison happens in the mathematical domain, not the language domain. This is super important because the result is that we get a highly efficient match thatās not thrown off by misspellings, typos, or even the use of different words.
An embedding similarity search is likely to class the phrases āPCā, ācomputerā and āApple Macā as being similar, even though thereās no common words. Thatās the beauty of embeddings and the machine learning models used to create them! Models like GPT-3 have learnt that those words are similar and that knowledge is āembeddedā in the embeddings they create.
Prompt Engineering
Once our vector store has found the chunks of our original document(s) that likely provide an answer to the userās question, itās a simple matter of adding that contextual information to a prompt and sending that to a generative AI completion endpoint, such as the ChatGPT API.
For example, we might have a prompt template something like this, into which we insert the userās question and the pieces of contextual text that we got from our vector store:
Use the following pieces of context to answer the question at the end. If you donāt know the answer, just say that you donāt know, donāt try to make up an answer.{context}Question: {question}
The beauty of this approach is that ChatGPT takes the pieces of context and rewords them to create a coherent answer to the question. In previous generations of technology weād probably have returned snippets of text from the source and hope that the user can make sense of them. But that approach isnāt at all natural and requires the user to piece together evidence from multiple places.
Using the approach Iāve described, we get a very natural conversation. Answers directly reference the question and different pieces of evidence are neatly summarised and integrated into a single coherent reply.
As a human, if you ask me a question Iām not going to just recite a set of references in reply. Instead, I will try to summarise those references and offer my interpretation of them and how they apply to the specifics of the question. Thatās just what ChatGPT does for us using the approach Iāve described.
Bing
Letās pause for a moment to think about how Microsoft Bing uses GPT-4.
Bing issues a search query, finds pieces of context and adds those to the GPT-4 prompt, from which an answer is constructed. In other words, Bing uses a very similar approach to the one Iāve described above.
If you want a version of Bing where the content is your content, rather than web searches, now you know how to build it.
An Example
Hereās a screen-grab of an implementation of the architecture discussed in this post. The content the system was ātrained onā is US President Bidenās State of the Union speech. As you can see, Iāve asked it what was said about covid. The GPT-4 API has done a really nice job of glueing the context into a cohesive answer to the question. The wording itās provided isnāt the wording of the original speech ā itās taken those original words and created a natural reply to the specific question I asked.
Langchain
I built that example using a useful library called Langchain thatās gaining a lot of popularity. Langchain does all the tedious work of making the API calls to the models, interacting with the vector store and injecting context into prompts.
import os
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import VectorDBQA
from langchain.indexes import VectorstoreIndexCreator
# OpenAI key
os.environ["OPENAI_API_KEY"] = "YOUR KEY"
# Load the source text and split into chunks
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
# Create embeddings and simple memory-only vector store
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embeddings)
# Create prompt and generate response
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.9)
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
chain = load_qa_chain(llm, chain_type="stuff", prompt=PROMPT)
query = "what was said about covid?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)
The above code implements a very simple vector search implementation with the ChatGPT API, Langchain and Chroma. As you can see, itās pretty easy ā thereās no complicated maths or obscure algorithms to juggle with.
Fine Tuning
I should mention that thereās another way of influencing LLM models, which is known as fine tuning. This approach takes our data and uses it to implement an additional layer of training onto the core model. Instead of injecting information into the prompt, fine tuning embeds it in a copy of the original model that you then use. Itās possible to fine tune open source models and OpenAI provides a fine tuning API for its earlier models. However, neither ChatGPT nor GPT-4 can currently be fine tuned.
Fine tuning is a very different approach to the one Iāve described here and merits a separate discussion, so I wonāt go into more detail in this post. I found this post a good description of how to go about fine tuning.
In my experience, injecting information into the prompt works extremely well and has the advantage of working across all the OpenAI models and even models from other providers like Cohere and AI21Labs. It might be worth experimenting with both approaches to see what works best for you.
Summary
We can build own own ChatGPT-like service that instead of just using ChatGPTās knowledge, uses our own private source of information.
We can use GPT-3ās Embeddings API to drive the search for pieces of text from a Vector Store that likely help to answer our userās question.
Then, we can use the ChatGPT API to piece this all together and construct a coherent answer to the userās question.
We can apply āguardrailsā to the modelās behaviour through the prompt template. And we can do the same to our userās behaviour by checking their input for inappropriate content with the Moderation endpoint.
This approach is hugely AI centric, but is an example of an AI application architecture thatās very definitely not āthrow every question at the model and prayā. Instead, we use a variety of different models/endpoints, together with a Vector Store and prompt templates, to drive a more sophisticated use of the underlying technology. The end result is the ability to build something that exploits niche or private content that the public models donāt know about.
šš» Please follow me on LinkedIn for updates on Generative AI šš»