Of GPT-4 and reasoning

Duncan Anderson

Published in

Barnacle Labs

9 min readMay 19, 2023

There’s two quite remarkable things about GPT-4:

It exhibits evidence of reasoning, and yet
It was never programmed to have any explicit reasoning ability

It’s weird.

There’s been a lot of focus on the ability of Large Language Models to answer questions, but I believe their breakthrough ability is reasoning. Because reasoning gives them the power to do some remarkable things.

But before we talk about the implications of this, let’s first examine the evidence for this reasoning ability…

The Bar Exam

The Bar Exam is the examination that US lawyers take in order to become qualified.

“In most jurisdictions, the examination is two days long and consists of multiple-choice questions, essay questions, and “performance tests” that model certain kinds of legal writing. The National Conference of Bar Examiners (NCBE) creates several component examinations that are used in varying combinations by all but two jurisdictions, sometimes in combination with locally drafted examination components… Generally, earning a degree from a law school is a prerequisite for taking the bar exam. Most law school graduates engage in a regimen of study (called “bar review”) between graduating from law school and sitting for the bar.” Wikipedia.

I take from that that the Bar Exam isn’t going to be especially easy.

But GPT-4 not only passed the exam, its scores exceeded those of 90% of human lawyers.

Now this all sounds very impressive, but for me the penny only really dropped when I looked at what the questions in this exam are like.

I find it hard to describe the power to answer such questions as involving anything other than an ability to reason.

Neurosurgery Exams

As an experiment, GPT-4 was compared against GPT-3.5 and Google Bard in performing the 149-question Self-Assessment Neurosurgery Exam (SANS).

The results once again were encouraging for GPT-4:

“GPT-4 demonstrated improved performance in question categories for which GPT-3.5 exhibited lower accuracy, such as incorporating higher-order problem-solving…”

Again, the penny dropped for me when I looked at the nature of the questions.

How can you not describe this as a form of reasoning?

GPT-4 saved my dog’s life

Twitter user @peakcooper reported an incredible use of GPT-4. His vet was struggling to diagnose the cause of his dog’s sickness, so he copied two sets of blood test results into GPT-4 and asked for its diagnosis.

Astonishingly, GPT-4 correctly analysed the meaning of the blood work, the differences between the two tests and the possible causes.

Bottom line: GPT-4 suggested a cause the vet hadn’t considered, but which turned out to save the dog’s life. Again, it’s tough to argue this example doesn’t exhibit some form of reasoning.

GPT-4 builds apps for you

By coincidence I was at a Generative AI meetup in London last night. One of the demos was of a solution that creates simple apps for you from just a textual description.

The example given was an app that took a photo of a can of coke, removed the background and placed the can in the jungle, generated a short catchy social media description and posted the result on Instagram. This was all automatically generated from a simple textual description of what the user wanted the app to do.

Again, we are miles away from simple questions and answers here. This solution could reason about the provided description, identify which services it needed to use (remove background, place in a new background, post to Instagram, etc), reason about the sequence those services needed to be called in and then execute that sequence.

What about business processes?

If GPT-4 can do all these things, surely automating a simple process must be a breeze in the park?

Conversational AI systems have loads of these examples — a series of simple steps that are required to get something done:

collect a series of data items
execute a series of troubleshooting tips
ask a series of questions to clarify a situation

The problem

Simple as a series of conversational steps ought to be, traditional technology has made them more complicated than you might expect.

Let me introduce you to the two technology villains:

The intent model. This is where we train a system on the things a human might ask it. The problem is that it’s virtually impossible to guess all the many and varied ways that humans speak (or type)… so most intent models are average at best.
A rules-based dialog system. Virtually all Conversational AI systems include a rules-based dialog engine. “Rules-based” means we need to lay out in excruciating detail every single possible conversational pathway. Again, it’s impossible (without herculean efforts) to do this in a way that reflects that variety and complexity of human behaviour.

I’ve built enough of these systems to have lost a few hair follicles to the frustrations involved. But not only are these things tedious and complicated, they also end up being very rigid — who hasn’t used a chatbot that replies with some variation of “I’m sorry, I’m not sure what you mean, try rephrasing your question.”?

Reasoning LLMs change everything

What if, instead of using a crystal ball to predict human behaviour and dream up an intent model and dialog tree (and consequently getting it wrong), we could:

Just give some English language instructions for how we want the system to behave.
Let the system itself figure out what to do and how to respond to a user, given the user’s context and the instructions provided.

That would be pretty cool, I think. And a lot simpler than intents and dialogs. The fact that such a system would react in realtime with a set of guiding instructions means it ought to be able to adapt to any situation it’s presented with.

But that requires a system that can reason about those instructions and the user’s context. Hmm…

A GPT-4 experiment

Might it be that GPT-4’s reasoning abilities allow it to fulfil this role of a conversational system that just needs some natural language instructions to reason about?

The basic use case of questions and answers we already have answered for us — GPT-4 can do that with style. Either using a bit of prompt engineering, or with the addition of a vector store, we can easily get it to respond intelligently to virtually any question — and all without a single intent definition.

But what about simple business processes, where the conversation needs to have a structure? Do this, then do this. If this is true, then do this instead. You get the idea.

Let’s try an experiment…

I copied the text of some broadband connection troubleshooting tips from the internet and created a system prompt from them. I preceded the troubleshooting tips with the following:

You are a broadband connection troubleshooting bot. Your job is to help users diagnose problems with their broadband connections. You must politely decline to answer any questions not directly related to broadband connection issues.

Troubleshooting should follow the following process. Never use text from more than one step to answer the question. You must always start at step 1. Move to other steps only if the user has confirmed that they have tried the previous steps. The troubleshooting process is:
1. ...
2. ...
3. ...

And here’s a video of me using this prompt, wired into GPT-4.

Notice several things:

The system isn’t phased by my swearing. Sorry for that! But in NLP you have to deal with the realities of how people might choose to interact. Real life can’t be hidden. In the past I’ve built intents to detect swearing but they are problematic — when is swearing a problem and when is it just a natural expression of someone’s frustration? GPT-4 seems to behave correctly. If I swear too much it gets a bit prudish, but it lets the odd blooper in the right context slip. Just like a human might handle the situation, really.
The system correctly interprets my prompt instructions and methodically follows the structured process that I’d outlined. One step at a time, just like I told it to.
The first step of the process instructs the system to confirm the user can make a phone call without noise on the line. Notice how I first confirm I could make a phone call, then the system prompts me to confirm if the call was noise-free.
Even when I try to throw it off the scent by making irrelevant comments, it very naturally acknowledges my comment and tries to gently usher me back to the process. Again, this is how a human would behave, but it’s absolutely not how a rules-based dialog system would ever behave — such systems are incapable of handling that kind of user behaviour.
It correctly reasons that no lights on my router probably means there’s no power and it tells me so. It seemed to reason about my problem and created its own very natural response that was informed by the troubleshooting tips. But it’s not repeating those tips, its interpreting them and responding with something very different from the original text of the troubleshooting tips.

I’m very impressed by this performance. GPT-4’s reasoning ability allowed it to follow my natural language description of the troubleshooting process.

👉 I tried the same with GPT-3.5, but it wasn’t good enough to get me through the process. Lesson: high performance LLMs make new things possible.

Building this level of sophistication with a traditional intent/dialog model would be a complete nightmare! But I wrote that prompt in literally 5 minutes with text copied from the internet.

What does this mean?

Is this the end of the intent/dialog model for conversational systems? I think it is. That troubleshooting use case was massively superior to anything I could imagine building with more conventional technology.

However, some use cases or processes will be too complex for GPT-4’s reasoning abilities today. But I think we’re on a very clear path to these reasoning abilities improving and so the future seems very clear to me. If you can’t do it today, you probably will be able to do it in the near future.

This idea that we give guidance to a model and allow it to hold a conversation is troubling to some, because we can never be totally sure about how it will respond. In traditional IT we’ve been able to define a set of tests that exercise every path through a system and tick them off to prove the system behaves the way we wish it to. With LLMs we don’t know what all the possible paths are, so we can’t define an exhaustive set of tests. But, I’d argue that many conventional systems are sufficiently complex that a complete set of tests is an illusion — it’s often beyond reasonable effort to trace every possible path. That’s why conventional systems have bugs — because it’s too difficult to define an exhaustive set of tests. So, really, LLMs aren’t so different after all.

In summary

The benefits of GPT-like technology, in terms of simplifying the way that we encode and execute conversational processes, are enormous. If today we can entrust a troubleshooting process to GPT-4, who knows what we’ll entrust it with tomorrow. Reasoning in Large Language Models is one of the big things that’s improved from GPT-3 -> GPT-3.5 -> GPT-4. We should probably expect this to improve further.