The Attention Based Transformer Model And Its Recent Results

Negoiţă D. D. Felix
12 min readDec 25, 2020

Since 2017 a new architecture of Natural Language Processing (NLP) has succeeded in establishing itself as the state-of-the art technology for language related tasks: the transformer. The present paper provides a summary of the context in which this new model appear as well as some of the challenges that needed to be tackled. It then explains how the transformer helps solve them through the attention mechanism and mentions two of its more recent developments (2019 and 2020) and how they have surpassed previous benchmark scores.

Introduction

Even though the transformer model as was as it was introduced in 2017 can be applied to a wider range of tasks, the literature focuses on its applications in the field of NLP, hence it is this particular aspect the present report shall follow as well. Language models in general are used for text generation, translations, creating summaries or answering reading comprehension sentences. They represent a probability distribution over sequences of tokens in a given language, e.g., if the target language is English, the model will output how likely a certain combination of tokens is to represent proper English structure. Hence, we can describe language models as follows:

Abstract description of any language model

Where d is the variable dimensionality of the document and y is our prediction.

The terms “token” and “word” are used interchangeably throughout this report for simplification. However, we much note that “word” is a technical term in linguistics and, as such, has a very specific meaning. “Token” is used in computer science and the field of NLP as a rough approximation of the linguistical term “morpheme”. When the distinction is relevant, the latter term will be used.

N.B. the document has to be encoded eventually in a fixed-sized vector.

The Convolutional and Recurrent Neural Networks that were state of the art before the introduction of the transformer architecture realized quite early on that, even though we can treat the input tokens as an unordered collection of words (more colloquially referred to as a “bag of words”), the predictions in such a case would rarely be useful. The reason for that is easily understood by humans through their intuition of natural language and can be expressed in the following model of generative grammar analysis of the sentence “John claimed that the rain caused accidents”:

X-Bar Theory example of inflection phrase — figure 1

Figure 1 helps illustrate that the ordering of morphemes in sentences has itself meaning. The syntax of languages influence their semantics through dependencies. Hence, handling long-term dependencies, id est, a model’s ability to “remember” tokens in earlier positions, is one of the challenges that the NLP architectures had to overcome.

Context For The Development Of The Transformer: RNNs and LSTMs

Before 2017, the literature and field were dominated by Recurrent Neural Networks, because of the need to take previous outputs as inputs in order to simulate context.

Recurrent Neural Network using a memory vector — figure 2

However, it becomes extremely computationally expensive as we are dealing with larger and larger sequences. Instead of simply passing all previous outputs, a much more efficient way is using a memory vector that is modified at each step of the RNN. Figure 2 exemplifies the high-level concepts of such an architecture.

The memory vector is an attempt to tackle the dependency problem. It contains information about previous tokens. However, because of vanishing and exploding gradients, the more iterations the network goes through, the more the memory vector is modified, with information either disappearing or being overwritten. Hence, performance drops with long sentences.

A variant of this architecture of RNN, called Long Short Term Memory Network approaches this problem by introducing a new type of neural cell. While its exact implementation is beyond the scope of the present paper, the new cell has the ability to, among other things, allow past information to skip a great deal of the processing done in the current step, thus dealing much better with longer sequences. However, a few issues still remained, among the most prominent of which were the long times needed for training, long gradient paths, not being compatible with transfer learning, needing specific labelled datasets for each task, and, most importantly, the their inherently serial nature. Since part of the input is the output of the previous cycle, such RNN models are unable to take advantage of the parallel computation capabilities of modern GPUs. While the transformer model addresses the need of parallelization, let us first take a look at the attention mechanism.

Attention

The attention mechanism can be viewed as letting the system decide which parts of the input to focus on. It has been first studied in the context of computer vision (2014) and then applied to NLP first in 2015 under the term “soft-search”. In Bahdanau and Cho (2015) an RNN (encoder-decoder) is still used as the architecture, however, instead of passing the same vector generated from all the hidden states when computing the target probability p(y), a different vector is passed for each target word and those vectors depend on annotations created from the input sequence by the encoder. An interesting approach that was created with the intention of dealing with vanishing and exploding gradients, it did, however, despite making it easier for humans to understand what the network is thinking, suffer from a couple of limitations and it still relied on the sequential nature of the RNN.

In general, attention of the type the transformer architecture employs uses three vectors, Q, K, and V, due to it mimicking the retrieval of a value vi for a query q based on a key ki in a database system. Needless to say that, since we are dealing with probability distributions, unlike a database, this is done in a fuzzy way. We can imagine that

where the similarity function returns measures that are then passed to a softmax function to produce weights which add to up 1. The latter are then multiplied with the values to produce attention vectors. If we pick a scaled dot product as our similarity function we then have

Where d represents the dimensionality of the keys.

The Transformer Architecture

The transformer proposed in 2017 is a sequence-to-sequence encoder-decoder network that abandons the recurrence entirely and, as the title of the original paper suggests, relies on the attention mechanism alone. Having, thus, no need for previous outputs to compute the target at a given state, the model can take advantage of modern GPUs and the parallelization they offer. The entire input sequence is passed into the transformer at once.

The encoder part’s main components are the positional embedding and multi-head self-attention. Through the first one, the transformer tackles the idea exemplified in Figure 1, id est, the same word’s ability to have different meanings depending on its positions in different sentences. Alongside the traditional embedding of the token, the paper uses the sin and cos functions to create an additional positional encoding vector, based on the token’s pos scalar. This results in a vector of the same dimensionality as the vector embedding the word, which means it can then be added to the token’s embedding.

Thus, the new vector will represent an embedding of the word and its context. There are, of course, multiple ways of achieving the same effect.

Example of adding positional information to word embeddings — figure 3

The concept of self-attention simply refers to the attention mechanism being applied the tokens of the same sequence. Each token embedding is multiplied with a weight matrices WQ, WK, and WV to generate the queries, keys, and values vectors respectively. Just like describes in formulas (1) and (2), by using a scaled dot product, similarity scores are calculated by multiplying the queries vector from one word with the keys vectors from every other word. Each (normalized) score is multiplied with the corresponding values vector and the results are summed to obtain the final attention vector for the token. The difference from generalized attention in more technical terms being best described by the fact that the queries, keys, and values vectors all come from the same sequence (input).

Exemplification of self-attention — figure 4

Through multi-head attention, the process described above is done N times in parallel, where N represents the number of heads. The important aspect here is that we generate different W weight matrices for each head, leading to different Q, K, V vectors. This allows the model to learn different to focus differently by experimenting with different representational sub-spaces. The resulting vectors are concatenated and normalized before being passed into the next layer.

The decoder takes the vectors from the encoder and generates output, while also being fed the previous outputs, until the “end of sentence” token is reached. Although this does look like it’s creating a recurrence, the method of ‘teacher forcing’ is used (in case of a mistake, the correct output is being fed in instead of propagating the mistake) to decouple the output of the decoder from its own input.

The decoder presents two attention layers. One is an “encoder-decoder” multi-head attention. The K and V vectors come from the output of the encoder, while the Q vector is generated from its second attention layer, the “masked multi-head attention”, using solely the predicted tokens. Since this earlier, masked, layer only uses output words and the target sequence is known in training, a mechanism was needed to ensure that the current token does not form a dependency from future ones. Otherwise, there would effectively be no learning.

Example of masking attention to future tokens — figure 5

This is achieved by employing a masking matrix which uses ­– infinity to ensure that the relationship with future tokens is marked by 0 in the resulting vector. Thus, formula (2) can be adapted as

After all of the discussed attention-related layers, the transformer uses normalization and feed-forward networks to ensure that each attention vector is passed to the next block in the way it is expected.

Results And Further Developments

The architecture published in 2017 achieved better or close-to-state-of-the-art results while reducing training costs. Needless to say this produced a real paradigm shift in the field and literature with multiple proposed transformers following soon after.

Proposed in 2019, the BERT model uses a series of transformer encoders to achieve much higher performance for language tasks that are not necessarily sequence-to-sequence, such as translation. Its training essentially consists of two main parts. In the first one, the model is pre-trained to gain a general understanding of language by using Masked Language Model (MLM) and Next Sentence Prediction (NSP). In MLM the goal is to predict tokens that were removed (masked) from a sentence, while NSP takes two different sentence as input and wants to predict whether or not the second one can follow the first in a logical manner. These two tasks are used to pre-train in an unsupervised method and simultaneously, id est, two sentences are being given as input and they also contain masked tokens. The word vectors are constructed from the traditional token embeddings and the positional embeddings we have seen in the transformer architecture to which a segment embedding is added, which contains information about which sentence the word comes from. Needless to say that the loss is minimized by taking into account the predictions for the masked tokens. What differs greatly from the original transformer is that the Masked Language Models allows the architecture to train in a bi-directional manner when constructing token embeddings, not just the left-to-right context we have seen in subsection 4.

After the pre-training, BERT can be fine-tuned for specific NLP tasks with supervised training. Only the output parameters are learnt from scratch while the rest of the model parameters are slightly modified, making this step much faster. The model obtains state-of-the-art performance on eleven natural language tasks.

In addition, the 2019 paper proves that scaling the model up to extreme sizes helps improve performance by a considerable margin for very granular tasks, even though this intuition existed in the literature for quite some time. It represents an undeniable stepping stone for OpenAI’s third iteration of their GPT model, briefly presented below.

The recently theorized and developed (2020) GPT 3 places a unique emphasis on rephrasing language learning as a “few-shot” problem, that is only using a few examples and creating a massive model of 175 billion parameters. Previous research has focused on fine-tuning the model for a specific task. However, with GPT-3, the pre-training is massive enough that the model performs extremely well on a wide variety of tasks with no input, one input, or few inputs and without updating its weights. The paper shows that because of that, even though the model does get better in proportion to the learning corpus size, the larger variants of GPT 3 make much better use of the information about the context of tokens. Human accuracy in detecting which piece of text is constructed by the model and which one was is human written is about 52%. Since OpenAI exposed an API to GTP 3, it has been successfully employed in tasks that were not traditionally the domain of language models per se, such as generating code or performing arithmetic problems formulated in natural language. As of the time of this paper, it still remains to be seen what magnitude of impact it will have on the field.

Conclusion

The field of Natural Language Processing has seen a real paradigm shift by the introduction of the transformer architecture. Renouncing the recurrent nature of the previous models, the transformer has given scientists the ability to take advantage of modern GPU parallelization processing for faster training. The attention mechanism has helped solve the issue of context in language, opening the way for much better predictions. The consequence has been, as one can expect, the development and training of more powerful architectures that have constantly pushed the boundaries of what state-of-the-art results can be. The actual improvements for each type of model pertaining to different benchmarks for NLP tasks, can be, of course, consulted in the papers referenced below, however, according to potential uses of tools as powerful as GPT 3, it may not be an exaggeration to state that the attention based transformer architecture as brought a much wider range of possibilities for language models.

References

Vaswani, et al, “Attention Is All You Need,” 2017.

N. Chomsky, “Remarks On Nominalization,” Reading in English Transformational Grammar, no. 184–221, 1970.

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, 1997.

D. Bahdanau and K. Cho, “Neural Machine Translation By Jointly Learning To Align And Translate,” 2015.

J. D. et. al, “BERT: Pre-Training Deep Bidirectional Transformers For Language Understanding,” 2019.

T. B. B. et al, “Language Models Are Few-Shot Learners,” 2020.

I have written this article on the workings of the Transformer architecture in NLP as an introduction for an essay on how said models are applied to higher order semantics and what we can learn from the results. I realise that this present post is far beyond the short texts that social media and our obsessive context-switching online behaviour have bestowed upon us. However, I believe that one simply cannot appease trends endlessly if one is serious about real and complex topics and education. There are no doubt many more explanations out there for what I have written and surely also better ones. However, I have found them either too brief or too technical and I have striven to obtain a balance between these two worlds. At the late hour of editing this, I am unsure of my success in that.

Thank you for your interest and making it this far! I could write endlessly about language, however I will stop here for now. You can also reach me at: https://www.linkedin.com/in/negoiță-felix-76858453 and I write book reviews in English and Romanian on www.negofelix.com

--

--

Negoiţă D. D. Felix

Software Developer 💻🎧 #coding | Data Engineering Philologist and lover of #books 📚 | Tech enthusiast 📱 https://negofelix.com | Join on Insta: @felix.negoita