How BERT is Revolutionizing Natural Language Processing

By Daniele Moro

In February of 2019, the AI community was enraged. OpenAI, a nonprofit research company co-founded by Elon Musk for the purpose of “[ensuring] that artificial general intelligence benefits all of humanity” decided to withhold their latest AI model, GPT2. This new machine learning model was a titan. Trained on 40GB of Internet text with over 1.5 billion parameters, GPT2 has the power to generate text so coherent and intelligent that it’s nearly indistinguishable from text written by a human. This technology has huge applications: from autocompleting emails and search queries, to writing news articles for reporters, and even making intelligent and helpful robots. Yet OpenAI realized that GPT2 could be used by malicious actors to generate fake news and inflammatory text at an unprecedented scale. “Due to our concerns about malicious applications of the technology” they decided to not release the trained model, violating their mission and their very name. I will explore why this model was so powerful, explain the context behind the BERT revolution that led to this event, and highlight the impact that this technology will have on our relationship with artificial intelligence.

The events that transpired in February of 2019 were born out of a revolution started by Google in a 2018 paper called “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. By understanding the essence of what makes the complex BERT model tick, one can begin to understand the scope of the BERT revolution. At its most basic level, BERT is a language model: a statistical algorithm that attempts to predict what word follows an input sequence of words. The human brain is a perfect example of a language model. For example if I show you this sentence, can you predict what word follows? “Happy birthday to _”. If you guessed “you”, your inner language model has been successfully trained from years of life experiences. Artificial intelligence scientists attempt to build artificial language models that output a probability distribution that can be used to pick the most likely word that follows an input sentence. But to work well, these models must be able to not only understand the intricacies and ambiguity of human language, but also develop an ability to reason about how language abstractly represents the world we perceive. To predict the word “you” in the example above would require a language model to understand what birthdays are, that they are happy occasions celebrated by humans, and that a particular human is celebrated at such events. Today, most language models cannot possess all the knowledge and reasoning this requires, and so they take shortcuts by finding statistical patterns in human language.

Relying on statistical patterns in human language can only make a language model so accurate. It may allow a simple model to learn that certain words occur more than other words, or that certain sequences of words like “happy birthday” are more common that other sequences of words like “smelly dream”. But in order to achieve near human-level coherence, the BERT revolution popularized a new kind of machine learning model called a Transformer that allows language models to process semantic context, recognize long-distance semantic relationships, and efficiently pre-train on gigabytes of textual data.

In the recent past, many language models used a distributional semantics technique called word2vec to process textual data. Word2vec takes as input a word like “bank” and converts it to a sequence of numbers called a vector that represents the word’s meaning. Machine learning models like Long Short Term Memory (LSTM) can then use the numbers created by word2vec to predict subsequent words. The rub of the word2vec approach is that the model doesn’t consider the context of words. For example, if I gave word2vec the sentences “I enjoyed the picnic on the bank” and “those people robbed a bank!”, word2vec would give me the same exact vector representation for both of the words “bank”, even though one example refers to the bank of a river, and the other refers to a financial institution. Obviously, the meaning of a sentence is not created by words in isolation, but as a complex interaction between all of the words in the sentence. By failing to capture the semantic context surrounding words, the word2vec approach forces language models to miss crucial contextual patterns at the core of human language.

To solve the problem of context, models such as ELMO rose to the occasion. ELMO employed the LSTM model to convert words to vectors in a way that preserved the context of the sentence. Therefore, to obtain the vector meaning of a word, the language models cannot just look at a word in isolation, but at the entire sentence at the same time. The LSTM model allows ELMO to step through a sentence one word at a time and adjust the meaning of current word by the words that came both before and after the current word. The approach that ELMO introduced vastly improved the capabilities of language models, but this was not enough. A traditional LSTM language model can only look at one word at a time. Therefore, by the time the language model sees the last word in the input sentence, it has already forgotten the first few words it read. This led to language models that understood the context of words but could only use the last part of the input text to make their prediction of what word comes next. ELMO-based LSTM models had a memory problem: they could not remember long-distance relationships between words.

In their seminal paper called “Attention is All You Need”, Google solved the problem of long-distance semantic relationships by introducing the Transformer, a new machine learning model that captures the meaning of the entire sentence at once, without needing to read one word at a time. The key innovation was the use of the Attention Layer, a type of trainable algorithm that gives each word in the sentence the ability to selectively pay attention to only certain other words in the sentence. This innovation soon gave birth to BERT, which many consider to be the spark that prooved how the Transformer can be used to create incredibly accurate language models that approach human-level coherence. By pre-training BERT on gigabytes of text, data scientists can then take the pre-trained model and fine-tune it to perform any number of specific tasks. For example, in my research, I used BERT to accurately determine the emotional response to descriptions of robot actions. My work could be used by robots in the future to adjust their behavior to produce more favorable emotional reactions from the people that interact with the robot.

In the two years since the BERT paper was realized, the field of Natural Language Processing has drastically changed, now abandoning LSTM models and word2vec in exchange for BERT-like transformer models. As described in the paper, BERT has allowed language models to vastly improve on a variety of tasks in the GLUE benchmark, a standard way to determine the effectiveness of language models. This success has inspired many successors, such as the famous GPT2, RoBERTa (Facebook’s optimized version of BERT), DistilBERT (a smaller and more efficient version of BERT), XLM-R (for multi-lingual classification tasks), and ALBERT (a smaller and significantly more accurate version of BERT).

Google has already started integrating BERT into its search engine, and in the near future, you are likely to find that artificial intelligence models are much more intelligent, able to understand more complex patterns found in human language. You can expect virtual assistants like Alexa and the Google Assistant to understand and respond to your requests much more coherently, data scientists will be able to use these models to better analyze the vast amount of social media textual data during events like the coronavirus epidemic, and you will even be able to play games like AI Dungeon, where the AI can generate engaging and responsive stories like never before. Although this technology can help us take one step closer to an AI of human-level intelligence, this has its dangers as well. Events such as the GPT2 incident shows how this technology could be abused by malicious agents to more directly and intelligently target people on social media and increase toxicity on a massive scale. For this reason, it is crucial for all engaged citizens to understand the power and potential of the BERT revolution.

danielemoro / blog

How BERT is Revolutionizing Natural Language Processing

About