Training a Large Language Model on your content.

Duncan Anderson
Barnacle Labs
Published in
10 min readMar 20, 2023

--

šŸ‘ØšŸ¼ā€šŸŽ“āž”ļøšŸ‘ØšŸ¼ā€šŸ’¼ TLDR ā€” Thereā€™s a number of approaches to getting Large Language Models to use your own private content. Increasingly large model contexts, together with the use of embedding and semantic search solutions, allow content to be injected into each model query. Such solutions can be very effective in teaching an LLM about private data it hasnā€™t previously been trained on.

šŸ“£ NEWS FLASH: OpenAI have annouced their ā€œpluginsā€ ā€” effectively adopting the very architecture described here. If youā€™re interested in OpenAIā€™s plugins, read on ā€” the approach I describe here is the same as plugins. Both use embeddings, vector databases and semantic search.

A former colleague contacted me this week to ask this question. I thought it a good one and worthy of a longer and more public answer than a direct reply. You could say that this post is dedicated to you, Jeremy! I hope both yourself and others find it interesting / useful.

Large Language Models (LLMs) like ChatGPT are trained on vast sets of natural language text. The benefit of these vast training sets is that the resultant model is pretty good at a wide variety of tasks. But there will be topics that an off-the-shelf LLM isnā€™t able to answer ā€” maybe the data isnā€™t publicly available and so couldnā€™t have been ingested by the model, or maybe itā€™s just a very niche topic that isnā€™t well represented in the public data sources. Hence the desire to ā€œteachā€ these models about such topics.

Prompt Engineering

When we think of teaching a model about particular topics, the first technique that I would think of is augmenting the prompt thatā€™s sent to the model.

Rather than just send a raw question to the model, we can augment that question with some additional context and instructions to encourage a more useful response.

Sometimes instructions in the prompt are all thatā€™s needed in order to focus the model on what we need. But we can also insert additional contextual information, for example:

You are a chat system for a bank. Your job is to answer questions about the bank's products. You are cuatious never to make a recommendation, instead providing information for the user to make their own assessments. Use the following context:

CONTEXT:
The bank offers the following products:
- A current account with an interest rate of 1.2%.
- An instant-access savings account with an interest rate of 2.3%.
- A savings account with a 30 day notice period and with an interest rate of 3.4%.

QUESTION:
{question}

By injecting information about the bankā€™s products into the prompt, we give the model the information needed to answer the product questions that the base model cannot.

With GPT-3 and ChatGPT the context size is 2k.

With the GPT-4 base model, that grows to 8k.

With GPT-4 thereā€™s even an option for a much larger 32k context... 32k being roughly equivalent to 50 pages of text. Thatā€™s a lot of context we can give it ā€” probably more than our mythical bank has to describe its products.

Passing such a large context in each API call feels a bit profligate. But for many uses this will at least prove a feasible approach and so I introduce it as the first possible solution.

Applying guardrails

A big concern amongst many is the risk that an LLM might answer a wide variety of interesting, but ultimately brand-risky, questions. Not many banks would be comfortable with their system answering questions like ā€œWhat do you think of Donald Trump?ā€.

But no sweat, we can address this in the prompt as well:

You are chatbot for a bank. Your job is to answer questions about the bankā€™s products. Use the following context to answer questions. If the question is not related to the context, you should reply ā€œIā€™m a large language model trained to answer products about our bankā€™s products, so am not able to answer that questionā€.CONTEXT ā€¦

You may have heard people talk about applying ā€œguardrailsā€ to models and this is a common way of doing this.

Semantic Search

What if, instead of trying to pass our entire training set to each API call, we instead identify which parts of that training set are potentially useful and just pass those? Perhaps the user isnā€™t asking about all of the bankā€™s products, but instead savings accounts in particular. That means we can prune what context we send to the model. This approach is becoming known as the ā€œSemantic Searchā€ pattern.

With a Semantic Search approach, we first find the pieces of our training data that might help answer the userā€™s question and pass only those to the model to formulate an answer.

Chunking

To implement a semantic search pattern we first need to chunk up our training data into small pieces (say, about 1,000 tokens for each chunk). Thereā€™s a variety of different strategies for doing this, but broadly weā€™re looking to create lots of pieces of text that hopefully have some coherence to them. If weā€™re lucky, we might find that each paragraph in our source maps to a chunk. If weā€™re unlucky, the paragraphs might be too large and so need splitting up. There are, of course, standard libraries for doing this.

Embeddings

How do we decide which chunks of our text are relevant to a given question? We do this by first of all using a machine learning model to create an ā€œembeddingā€ for each chunk.

An embedding is simply a mathematical representation of the statistical pattern of words and characters in each chunk. Machine learning models have a deep understanding of those patterns and use that understanding to create each embedding for us.

In mathematical terms an embedding is just a vector. A lot of us will have learnt about vector maths in secondary school and itā€™s a relatively easy mathematical concept. But we only need to understand vectors if weā€™re interested in the mechanics ā€” for the most part we only need to receive and store the vectors, the maths is done for us by libraries and models.

There are lots of ways to create embeddings, but one of the simplest is to use the OpenAI embeddings API. This uses GPT-3 to create an embedding and you donā€™t need to know much more than that! All the underlying maths is done for you ā€” we pass our text and get an embedding back. Simple!

Vector Stores

Once weā€™ve got embeddings for our chunks of text, we then store them in a vector store.

Vector stores are highly specialised databases that store our embeddings and allow us to run algorithms against them (weā€™ll come on to what those algorithms are in a moment). Likely candidates for a vector store include:

What differentiates a vector store from a traditional database is that it includes the ability to run a similarity match to compare an embedding that represents a userā€™s question with those in our vector store. The similarity search identifies the top n embeddings which represent the pieces of our original text that can likely answer the userā€™s question.

Cosine Similarity

For the mathematically oriented, a similarity search most often uses a cosine similarity algorithm. This is a very common algorithm that any programmer could write in just a few lines of code ā€” if you ask ChatGPT, itā€™ll write it for you.

But why bother? A good vector store includes this logic and does the matching for you without you needing to write a single line of code.

Similarity Search with Embeddings is better

Itā€™s important to understand that a similarity search that uses embeddings isnā€™t looking at keywords or even doing a fuzzy match, both of which are poor ways of reliably comparing text. Instead, itā€™s comparing the mathematical representation of the statistical patterns in the original text. All the comparison happens in the mathematical domain, not the language domain. This is super important because the result is that we get a highly efficient match thatā€™s not thrown off by misspellings, typos, or even the use of different words.

An embedding similarity search is likely to class the phrases ā€œPCā€, ā€œcomputerā€ and ā€œApple Macā€ as being similar, even though thereā€™s no common words. Thatā€™s the beauty of embeddings and the machine learning models used to create them! Models like GPT-3 have learnt that those words are similar and that knowledge is ā€˜embeddedā€™ in the embeddings they create.

Prompt Engineering

Once our vector store has found the chunks of our original document(s) that likely provide an answer to the userā€™s question, itā€™s a simple matter of adding that contextual information to a prompt and sending that to a generative AI completion endpoint, such as the ChatGPT API.

For example, we might have a prompt template something like this, into which we insert the userā€™s question and the pieces of contextual text that we got from our vector store:

Use the following pieces of context to answer the question at the end. If you donā€™t know the answer, just say that you donā€™t know, donā€™t try to make up an answer.{context}Question: {question}

The beauty of this approach is that ChatGPT takes the pieces of context and rewords them to create a coherent answer to the question. In previous generations of technology weā€™d probably have returned snippets of text from the source and hope that the user can make sense of them. But that approach isnā€™t at all natural and requires the user to piece together evidence from multiple places.

Using the approach Iā€™ve described, we get a very natural conversation. Answers directly reference the question and different pieces of evidence are neatly summarised and integrated into a single coherent reply.

As a human, if you ask me a question Iā€™m not going to just recite a set of references in reply. Instead, I will try to summarise those references and offer my interpretation of them and how they apply to the specifics of the question. Thatā€™s just what ChatGPT does for us using the approach Iā€™ve described.

Bing

Letā€™s pause for a moment to think about how Microsoft Bing uses GPT-4.

Bing issues a search query, finds pieces of context and adds those to the GPT-4 prompt, from which an answer is constructed. In other words, Bing uses a very similar approach to the one Iā€™ve described above.

If you want a version of Bing where the content is your content, rather than web searches, now you know how to build it.

An Example

Hereā€™s a screen-grab of an implementation of the architecture discussed in this post. The content the system was ā€˜trained onā€™ is US President Bidenā€™s State of the Union speech. As you can see, Iā€™ve asked it what was said about covid. The GPT-4 API has done a really nice job of glueing the context into a cohesive answer to the question. The wording itā€™s provided isnā€™t the wording of the original speech ā€” itā€™s taken those original words and created a natural reply to the specific question I asked.

Langchain

I built that example using a useful library called Langchain thatā€™s gaining a lot of popularity. Langchain does all the tedious work of making the API calls to the models, interacting with the vector store and injecting context into prompts.

import os
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import VectorDBQA
from langchain.indexes import VectorstoreIndexCreator

# OpenAI key
os.environ["OPENAI_API_KEY"] = "YOUR KEY"

# Load the source text and split into chunks
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

# Create embeddings and simple memory-only vector store
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embeddings)

# Create prompt and generate response
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.9)
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
chain = load_qa_chain(llm, chain_type="stuff", prompt=PROMPT)
query = "what was said about covid?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

The above code implements a very simple vector search implementation with the ChatGPT API, Langchain and Chroma. As you can see, itā€™s pretty easy ā€” thereā€™s no complicated maths or obscure algorithms to juggle with.

Fine Tuning

I should mention that thereā€™s another way of influencing LLM models, which is known as fine tuning. This approach takes our data and uses it to implement an additional layer of training onto the core model. Instead of injecting information into the prompt, fine tuning embeds it in a copy of the original model that you then use. Itā€™s possible to fine tune open source models and OpenAI provides a fine tuning API for its earlier models. However, neither ChatGPT nor GPT-4 can currently be fine tuned.

Fine tuning is a very different approach to the one Iā€™ve described here and merits a separate discussion, so I wonā€™t go into more detail in this post. I found this post a good description of how to go about fine tuning.

In my experience, injecting information into the prompt works extremely well and has the advantage of working across all the OpenAI models and even models from other providers like Cohere and AI21Labs. It might be worth experimenting with both approaches to see what works best for you.

Summary

We can build own own ChatGPT-like service that instead of just using ChatGPTā€™s knowledge, uses our own private source of information.

We can use GPT-3ā€™s Embeddings API to drive the search for pieces of text from a Vector Store that likely help to answer our userā€™s question.

Then, we can use the ChatGPT API to piece this all together and construct a coherent answer to the userā€™s question.

We can apply ā€˜guardrailsā€™ to the modelā€™s behaviour through the prompt template. And we can do the same to our userā€™s behaviour by checking their input for inappropriate content with the Moderation endpoint.

This approach is hugely AI centric, but is an example of an AI application architecture thatā€™s very definitely not ā€œthrow every question at the model and prayā€. Instead, we use a variety of different models/endpoints, together with a Vector Store and prompt templates, to drive a more sophisticated use of the underlying technology. The end result is the ability to build something that exploits niche or private content that the public models donā€™t know about.

šŸ‘‰šŸ» Please follow me on LinkedIn for updates on Generative AI šŸ‘ˆšŸ»

--

--

Duncan Anderson
Barnacle Labs

Eclectic tastes, amateur at most things. Learning how to build a new startup. Former CTO for IBM Watson Europe.