Hype or hope – are we witnessing a breakthrough in NLP technology?

A confusing array of deeply technical NLP announcements, coupled with a healthy degree of press hype, make it tough to understand what’s happening. Here, I attempt to make sense of things.

We’ve seen an exciting array of Natural Language Processing (NLP) algorithms and technologies announced over the past year – many of which sound incredibly exciting. But should we get excited, or not? This post outlines my findings in a way that, hopefully, anyone can follow. I will try to avoid the intricacies of machine learning models and focus instead on their implications and impact. For those interested in the nitty-gritty, I will point to other more technical articles that I’ve found useful in developing an understaning of this confusing space.

🎬 Getting started

For NLP geeks, the past year has changed all of that — and been quite an exciting period. Transformer models, vast training sets, GPT-2, BERT, XLNet, Meena — we’ve never had such an exciting set of announcements!

🧵 The common threads

  • The use of larger and more ambitious models, trained on vast open source sets of training data. With a more extensive training set, the latest models achieve good results without always needing any additional domain-specific training. As domain-specific training datasets are not always easy to come by, this is quite a big deal.
  • The use of transformer models. Most machine learning models today are based upon recurrent or convolutional neural networks. A transformer model is a new approach that exploits the concept of self-attention — the ability to take into account a broader set of the input data when a model makes an evaluation. If you’re interested in the specifics of how transformer models work, I found this post very educational.

The new models we’re seeing emerge are typically multi-purpose in their applicability, being capable of being applied to a variety of NLP tasks such as question/answer, text classification, machine translation, sentiment analysis, etc.

🏆 Measuring success

On both GLUE and SQuAD, we have transformer models exceeding human performance. And on RACE, the models exceed average human performance as defined by Amazon mechanical turk, but not yet the very best humans (human ceiling performance). Quite a good sign, I would say!

But a word of caution…

These leaderboards measure specific aspects of NLP tasks and do not fully reflect the level of additional reasoning that we as humans apply on top of basic reading/comprehension. It’s possible for a model to exceed the performance of humans at understanding text, but fail to impress us because it doesn’t include that additional reasoning that we take for granted.

For example, if I read the following passage of text as a human, I can easily infer that, as the current year is 2020, it’s the 35th anniversary of the foundation of NeXT.

Following his resignation from Apple in 1985, Jobs founded NeXT Inc. with $7 million.

Current NLP models do not make that inference and so are unable to answer the question “how long ago was NeXT founded?” because they can only provide answers that are extracts from the original text. It’s important to look behind the hype and understand what NLP advancements can and cannot do for us – because high performing NLP models on their own still don’t represent “intelligence” as most humans would perceive it.

🏋️‍♂️ A word on Pre-Training

Firstly, training occurs typically on a vast and diverse training set — the resultant models are referred to as “pre-trained” and are provided to implementors who are expected to undertake a “fine tuning” step that adds domain-specific knowledge to the model – if needed. Fine-tuning is optional and is frequently not required, as the base models have such broad training.

It turns out that the availability of pre-trained models is critical, as those models are very large and take significant amounts of compute resources to build. We’re talking days of training time and hundred’s of thousand of dollars of cost— don’t try this at home!

Fine-tuning is a much cheaper task than training, but you probably still want to do this on cloud GPUs, rather than your laptop. Luckily, Google CoLab (which has become a bit of a standard for such tasks) provides an easy way to do this, together with generous free allocations of resources. Here’s a tutorial on fine-tuning BERT on Google CoLab.

One thing is for certain — this is no Dialogflow, Rasa or Watson Assistant. Expoiting the latest NLP technologies requires an investment in skills and resources. The software may be free, but you will pay in other ways.

I personally find it very hard to work out how a model might be used for a specific task, like for example text classification – usage documentation is typically scant at best, if it even exists. Luckily the folks at HuggingFace are correcting this, providing easy to use solutions that exploit all of the underlying models I’m about to discuss. If you want to play with the technology, I highly encourage you to take a look at HuggingFace first.

🧮 The models


GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. Although this may seem simple, it’s actually a big innovation over previous approaches. Until GPT-2, NLP models typically relied solely on the last word in the sequence to make a prediction for the next word — meaning GPT-2 introduces vastly more context to the task and enables much more accurate predictions than we’ve seen before.

It’s worth emphasising that GPT-2 uses a left-to-right training approach — ie it uses previous words to predict a missing word, but not subsequent words. This is a significant limitation, which we’ll come on to later.

GPT-2 has also been trained on a vast dataset — 8 million web pages, 40 gigbytes of data. In the past we’ve seen NLP models trained on more controlled sources, such as the contents of Wikipedia. GPT-2’s training sources are much more varied and eclectic. Given the right starting words, it can generate paragraphs in many different styles — adventure stories, news reports and even porn scripts!

The model’s fame came about because of its ability to generate seemingly plausible, but entirely ficticous parapgraphs of text. Its party trick is to take a few starting words and generate a complete paragraph from them.

For example, I entered “Duncan Anderson is” and got the following (all completely made up, I hesitate to add) from GPT-2:

Duncan Anderson is a PhD candidate in Human Geography in the Department of Geography, University of Calgary. His research focuses on public health, social justice, and globalization. Duncan has an interest in comparative spatial development, addressing the understanding of global migration from outside of North America and Europe. This work is funded in part by the Industrial Research Assistance Program (IRAP), Globalisation and Health.

Not bad, huh? You can generate your own fictitious life stories at talk to transformers.

GPT-2’s wider context for predicting words, together with its vast training set, means that it’s able to generate paragraphs of text that are coherent and plausible — a big advance over previous technologies.

OpenAI was initially reluctant to release the full GPT-2 model, preferring to expose more limited versions of it. They cited particular concern about the ability to fine-tuning GPT-2 models on four ideological positions: white supremacy, Marxism, jihadist Islamism, and anarchism. This whole back-story, of course, proved far too tempting for the press — resulting in the inevitable dramatic headlines. My take: such headlines are very, very far from the truth and should be completely ignored!


Bearing in mind that no actual malicious uses emerged from the earlier more limited releases, in November 2019 OpenAI released the full GPT-2 model to the world. They also developed and released a GPT-2 Output Detector, which is said to have a 95% accuracy rate. My personal stance it that GPT-2’s generated text, whilst plausible, is still too far removed from reality to represent a threat at this point.

Now generating plausible but fake text is all very interesting, but can GPT-2 do anything more useful? One intriguing example is the adventure game AI Dungeon, which uses GPT-2 to generate infinitely variable adventure game scenarios.

AI Dungeon

So, is GPT-2 intelligent? The Economist published an interview with GPT-2, implying it might be. However, when we dig a little further into that interview’s making, we find that for each question GPT-2 generated 5 different reponses. The reporter chose to publish what made the most sense, and that fitted his narrative. If we had seen the full responses, I doubt many of us would consider there to be much intelligence there.

The astute may have noticed a fatal flaw in GPT-2’s text generation abilities — that you prime it with some words, but what direction it chooses to take from there is entirely outside of your control. Whilst the results may be amusing, it’s difficult to see many practical applications for this. However, new work by Uber is providing an ability to influence and steer how the model behaves. Much more is required in this area to make models like GPT-2 truly useful, for example by enabling the generation of text that’s based on a corpus of known facts.


Like GPT-2, BERT is a transformer model. Pre-trained BERT models are becoming widely available and provide excellent performance across a range of common NLP tasks. If training data on a specific domain is available, fine-tuning of the model can be achieved in a few hours, using a single GPU. Not exactly the almost instant training we’re used to with production NLP solutions like IBM Watson, but at least something that’s economically possible.

The key innovation of BERT is its bidirectional nature — whilst GPT-2 learns from words before that being processed, BERT learns from both the words before and after. This broader “attention” is a big contributor to BERT’s accuracy.

BERT’s party trick is in question answering — provide a body of text, ask questions of that text and watch BERT find the answer. You can try it out here.

What’s impressive here is BERT’s ability to find an answer to a question from an otherwise “unseen” body of text, with no training. It’s not surprising that Google now uses this ability to improve its core product, search.

But BERT can be used for a wide variety of NLP tasks – for example, here’s a tutorial on using it for text classification, one of the most useful of NLP tasks.


In Text Classification, arguably one of the most critical, XLNet reduces error rates by around 16% when compared to BERT.

For those interested in the details of how XLnet differs from BERT, I found this post very helpful.

🏎 Model optimisation

In response to concerns about cost and performance, we’ve seen a number of efforts at optimising transformer models, such as DistilBERT which retains 99% of BERT’s accuracy whilst running 60% faster. There is probably still a lot of work required to optimise and make these enormous models more efficient.

💬 Meena, a next generation chatbot?

Human: I like how cute cows are

Meena: Ah, interesting. You like how they are cute, right?

Human: yeah, and they’re realy smart

Meena: I heard they go to college

Human: cows go to college?

Meena: I heard that a cow went to Harvard.

Human: what did the cow study?

Meena: Bovine sciences.

Human: do horses go to Harvard?

Meena: Horses go to Hayvard.

Human: that’s a pretty good joke, I feel like you led me into it

Meena: You were trying to steer it elsewhere, I can see it.

Pretty cool, huh? However, as impressive as that is, Meena still has some big drawbacks.

Firstly – whilst that example is great – from what Google states, Meena does not always produce meaningful conversations. Imagine having a chatbot for your company that is only coherent most of the time. Any client I’ve ever worked with would demand coherence all of the time.

Secondly, Meena’s training took 30 full days on 2048 compute cores – at an estimated $1.4m!

Lastly, that joke was quite cool, but where did it come from? The answer is that nobody knows. It could have just as easily been a joke that made no sense, or a joke that implied some undesirable characteristic like racism or sexism. What do we do when a client phones us up and says “I don’t like that joke, please make sure it doesn’t do that again”? The answer is that we can’t do anything. Hunting down the specific pieces of training data that influenced that joke’s creation is practically impossible. And even if we could, fixing that data and retraining might cost us over $1.4m in compute costs!

So Meena is exciting and impressive, but also perplexing. Whilst it points to the future, it’s tough to see many practical applications today. As a piece of research, I can’t help but be impressed – but it’s incomplete at this point. More to do!

📝 Summary

  1. Large models are computationally expensive to operate – and this can preclude them from being used in some commercial settings, simply because the cost-per-transaction is too high.
  2. We’re still in the early days of learning how to operate these models in a production setting and there’s a lot of work to do. Using BERT or XLnet today requires a level of skill and knowledge that, whilst not impossible to gain, is a big step up from your average chatbot tool.
  3. Perhaps most importantly, pre-trained models, by their very nature, are unpredictable. We can’t just tinker with the training data to adjust responses like we do with today’s technologies – doing so might cost us 100’s of thousands of pounds of compute resources to rebuild the model each time. And whilst the models are impressive, they sometimes provide strange responses that we can’t influence.
  4. We’re still missing the ability to augment these models with a corpus of known facts. Rather than allowing models to generate plausible text, we need to be able to get them to generate text that relates to set of facts that we specify. Chit-chat is all well and good, but the core of a chatbot needs to be providing factual responses on the core subject, not plausible off-topic responses.

(1) and (2) feel like hygiene factors, whereas (3) and (4) are areas for further research. Transfer models are a big leap forward, but there’s still a lot more work to do.

So, to answer my opening question: are we seeing a breakthrough in NLP progress? I think the answer is that, yes, we are seeing a breakthrough… but mostly that breakthrough is a work in progress. There are significant barriers to the adoption of these technologies and much still to do. BERT, XLNet, GPT-2 and Meena aren’t, of themselves, going to lead to a new generation of chatbots that feel human. But they probably do point to an exciting future where more research builds on these underpinnings to deliver something amazing. It feels that we’re at the start of an exciting journey, not the end.

Watch this space!

Eclectic tastes, amateur at most things. Learning how to build a new startup. Former CTO for IBM Watson Europe.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store