Training a Large Language Model on your content.

Published in

Barnacle Labs

10 min readMar 20, 2023

👨🏼‍🎓➡️👨🏼‍💼 TLDR — There’s a number of approaches to getting Large Language Models to use your own private content. Increasingly large model contexts, together with the use of embedding and semantic search solutions, allow content to be injected into each model query. Such solutions can be very effective in teaching an LLM about private data it hasn’t previously been trained on.

📣 NEWS FLASH: OpenAI have annouced their “plugins” — effectively adopting the very architecture described here. If you’re interested in OpenAI’s plugins, read on — the approach I describe here is the same as plugins. Both use embeddings, vector databases and semantic search.

A former colleague contacted me this week to ask this question. I thought it a good one and worthy of a longer and more public answer than a direct reply. You could say that this post is dedicated to you, Jeremy! I hope both yourself and others find it interesting / useful.

Large Language Models (LLMs) like ChatGPT are trained on vast sets of natural language text. The benefit of these vast training sets is that the resultant model is pretty good at a wide variety of tasks. But there will be topics that an off-the-shelf LLM isn’t able to answer — maybe the data isn’t publicly available and so couldn’t have been ingested by the model, or maybe it’s just a very niche topic that isn’t well represented in the public data sources. Hence the desire to “teach” these models about such topics.

Prompt Engineering

When we think of teaching a model about particular topics, the first technique that I would think of is augmenting the prompt that’s sent to the model.

Rather than just send a raw question to the model, we can augment that question with some additional context and instructions to encourage a more useful response.

Sometimes instructions in the prompt are all that’s needed in order to focus the model on what we need. But we can also insert additional contextual information, for example:

You are a chat system for a bank. Your job is to answer questions about the bank's products. You are cuatious never to make a recommendation, instead providing information for the user to make their own assessments. Use the following context:

CONTEXT:
The bank offers the following products:
- A current account with an interest rate of 1.2%.
- An instant-access savings account with an interest rate of 2.3%.
- A savings account with a 30 day notice period and with an interest rate of 3.4%.

QUESTION:
{question}

By injecting information about the bank’s products into the prompt, we give the model the information needed to answer the product questions that the base model cannot.

With GPT-3 and ChatGPT the context size is 2k.

With the GPT-4 base model, that grows to 8k.

With GPT-4 there’s even an option for a much larger 32k context... 32k being roughly equivalent to 50 pages of text. That’s a lot of context we can give it — probably more than our mythical bank has to describe its products.

Passing such a large context in each API call feels a bit profligate. But for many uses this will at least prove a feasible approach and so I introduce it as the first possible solution.

Applying guardrails

A big concern amongst many is the risk that an LLM might answer a wide variety of interesting, but ultimately brand-risky, questions. Not many banks would be comfortable with their system answering questions like “What do you think of Donald Trump?”.

But no sweat, we can address this in the prompt as well:

You are chatbot for a bank. Your job is to answer questions about the bank’s products. Use the following context to answer questions. If the question is not related to the context, you should reply “I’m a large language model trained to answer products about our bank’s products, so am not able to answer that question”.CONTEXT …

You may have heard people talk about applying “guardrails” to models and this is a common way of doing this.

Semantic Search

What if, instead of trying to pass our entire training set to each API call, we instead identify which parts of that training set are potentially useful and just pass those? Perhaps the user isn’t asking about all of the bank’s products, but instead savings accounts in particular. That means we can prune what context we send to the model. This approach is becoming known as the “Semantic Search” pattern.

With a Semantic Search approach, we first find the pieces of our training data that might help answer the user’s question and pass only those to the model to formulate an answer.

Chunking

To implement a semantic search pattern we first need to chunk up our training data into small pieces (say, about 1,000 tokens for each chunk). There’s a variety of different strategies for doing this, but broadly we’re looking to create lots of pieces of text that hopefully have some coherence to them. If we’re lucky, we might find that each paragraph in our source maps to a chunk. If we’re unlucky, the paragraphs might be too large and so need splitting up. There are, of course, standard libraries for doing this.

Embeddings

How do we decide which chunks of our text are relevant to a given question? We do this by first of all using a machine learning model to create an “embedding” for each chunk.

An embedding is simply a mathematical representation of the statistical pattern of words and characters in each chunk. Machine learning models have a deep understanding of those patterns and use that understanding to create each embedding for us.

In mathematical terms an embedding is just a vector. A lot of us will have learnt about vector maths in secondary school and it’s a relatively easy mathematical concept. But we only need to understand vectors if we’re interested in the mechanics — for the most part we only need to receive and store the vectors, the maths is done for us by libraries and models.

There are lots of ways to create embeddings, but one of the simplest is to use the OpenAI embeddings API. This uses GPT-3 to create an embedding and you don’t need to know much more than that! All the underlying maths is done for you — we pass our text and get an embedding back. Simple!

Vector Stores

Once we’ve got embeddings for our chunks of text, we then store them in a vector store.

Vector stores are highly specialised databases that store our embeddings and allow us to run algorithms against them (we’ll come on to what those algorithms are in a moment). Likely candidates for a vector store include:

What differentiates a vector store from a traditional database is that it includes the ability to run a similarity match to compare an embedding that represents a user’s question with those in our vector store. The similarity search identifies the top n embeddings which represent the pieces of our original text that can likely answer the user’s question.

Cosine Similarity

For the mathematically oriented, a similarity search most often uses a cosine similarity algorithm. This is a very common algorithm that any programmer could write in just a few lines of code — if you ask ChatGPT, it’ll write it for you.

But why bother? A good vector store includes this logic and does the matching for you without you needing to write a single line of code.

Similarity Search with Embeddings is better

It’s important to understand that a similarity search that uses embeddings isn’t looking at keywords or even doing a fuzzy match, both of which are poor ways of reliably comparing text. Instead, it’s comparing the mathematical representation of the statistical patterns in the original text. All the comparison happens in the mathematical domain, not the language domain. This is super important because the result is that we get a highly efficient match that’s not thrown off by misspellings, typos, or even the use of different words.

An embedding similarity search is likely to class the phrases “PC”, “computer” and “Apple Mac” as being similar, even though there’s no common words. That’s the beauty of embeddings and the machine learning models used to create them! Models like GPT-3 have learnt that those words are similar and that knowledge is ‘embedded’ in the embeddings they create.

Prompt Engineering

Once our vector store has found the chunks of our original document(s) that likely provide an answer to the user’s question, it’s a simple matter of adding that contextual information to a prompt and sending that to a generative AI completion endpoint, such as the ChatGPT API.

For example, we might have a prompt template something like this, into which we insert the user’s question and the pieces of contextual text that we got from our vector store:

Use the following pieces of context to answer the question at the end. If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.{context}Question: {question}

The beauty of this approach is that ChatGPT takes the pieces of context and rewords them to create a coherent answer to the question. In previous generations of technology we’d probably have returned snippets of text from the source and hope that the user can make sense of them. But that approach isn’t at all natural and requires the user to piece together evidence from multiple places.

Using the approach I’ve described, we get a very natural conversation. Answers directly reference the question and different pieces of evidence are neatly summarised and integrated into a single coherent reply.

As a human, if you ask me a question I’m not going to just recite a set of references in reply. Instead, I will try to summarise those references and offer my interpretation of them and how they apply to the specifics of the question. That’s just what ChatGPT does for us using the approach I’ve described.

Bing

Let’s pause for a moment to think about how Microsoft Bing uses GPT-4.

Bing issues a search query, finds pieces of context and adds those to the GPT-4 prompt, from which an answer is constructed. In other words, Bing uses a very similar approach to the one I’ve described above.

If you want a version of Bing where the content is your content, rather than web searches, now you know how to build it.

An Example

Here’s a screen-grab of an implementation of the architecture discussed in this post. The content the system was ‘trained on’ is US President Biden’s State of the Union speech. As you can see, I’ve asked it what was said about covid. The GPT-4 API has done a really nice job of glueing the context into a cohesive answer to the question. The wording it’s provided isn’t the wording of the original speech — it’s taken those original words and created a natural reply to the specific question I asked.

Langchain

I built that example using a useful library called Langchain that’s gaining a lot of popularity. Langchain does all the tedious work of making the API calls to the models, interacting with the vector store and injecting context into prompts.

import os
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import VectorDBQA
from langchain.indexes import VectorstoreIndexCreator

# OpenAI key
os.environ["OPENAI_API_KEY"] = "YOUR KEY"

# Load the source text and split into chunks
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

# Create embeddings and simple memory-only vector store
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(docs, embeddings)

# Create prompt and generate response
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.9)
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
chain = load_qa_chain(llm, chain_type="stuff", prompt=PROMPT)
query = "what was said about covid?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

The above code implements a very simple vector search implementation with the ChatGPT API, Langchain and Chroma. As you can see, it’s pretty easy — there’s no complicated maths or obscure algorithms to juggle with.

Fine Tuning

I should mention that there’s another way of influencing LLM models, which is known as fine tuning. This approach takes our data and uses it to implement an additional layer of training onto the core model. Instead of injecting information into the prompt, fine tuning embeds it in a copy of the original model that you then use. It’s possible to fine tune open source models and OpenAI provides a fine tuning API for its earlier models. However, neither ChatGPT nor GPT-4 can currently be fine tuned.

Fine tuning is a very different approach to the one I’ve described here and merits a separate discussion, so I won’t go into more detail in this post. I found this post a good description of how to go about fine tuning.

In my experience, injecting information into the prompt works extremely well and has the advantage of working across all the OpenAI models and even models from other providers like Cohere and AI21Labs. It might be worth experimenting with both approaches to see what works best for you.

Summary

We can build own own ChatGPT-like service that instead of just using ChatGPT’s knowledge, uses our own private source of information.

We can use GPT-3’s Embeddings API to drive the search for pieces of text from a Vector Store that likely help to answer our user’s question.

Then, we can use the ChatGPT API to piece this all together and construct a coherent answer to the user’s question.

We can apply ‘guardrails’ to the model’s behaviour through the prompt template. And we can do the same to our user’s behaviour by checking their input for inappropriate content with the Moderation endpoint.

This approach is hugely AI centric, but is an example of an AI application architecture that’s very definitely not “throw every question at the model and pray”. Instead, we use a variety of different models/endpoints, together with a Vector Store and prompt templates, to drive a more sophisticated use of the underlying technology. The end result is the ability to build something that exploits niche or private content that the public models don’t know about.

👉🏻 Please follow me on LinkedIn for updates on Generative AI 👈🏻

Training a Large Language Model on your content.

Prompt Engineering

Applying guardrails

Semantic Search

Chunking

Embeddings

Vector Stores

Cosine Similarity

Similarity Search with Embeddings is better

Prompt Engineering

Bing

An Example

Langchain

Fine Tuning

Summary

Written by Duncan Anderson