Question Answering (Part 4): Transformers
In the previous article in this series (Part 3) we talked about the different data sets that can be used to build and benchmark Question Answer models. Each of these data sets maintain leader boards recording the top performing models on the data set. You will find transformers, and transformer based architectures if you dive into the top models on the leader board of these datasets.
In this article we will present a high level overview of attention, and transformers which form the building block of most modern Question Answering models.
Recurrent Neural Network (RNN)
RNN is a type of neural network which can be used for sequence learning. RNN process the text sequentially. In each step a RNN takes as its input the current element from the sequence vector, and the previous hidden state to generate the current hidden state (hₜ). For a RNN shown below, the current hidden state hₜ is a function of the previous hidden state and current input Xₜ The current hidden state (hₜ) is used to generate the output (oₜ) for the current state.
hₜ = f(ht-1, Xₜ)
oₜ. = g (hₜ)

Sequence to Sequence (Seq2Seq) Models Using Recurrent Neural Networks (RNN)
Sequence to Sequence (Seq2Seq) models made popular by Google since 2014, have been the main workhorse of Natural Language Processing tasks such as machine translation, and summarization. Sequence to sequence models consist of an encoder layer, and a decoder layer both of which are implemented using Recurrent Neural Networks (RNN) or a variant such as Long Short Term Memory (LSTM). Encoder layer converts the input sequence of text into a context vector which is used by the decoder layer to generate the output for the task at hand.

Move Towards Attention
The image below unfolds the encoder and decoder RNN for a typical machine translation task. There is a loss in information that reaches the decoder because only the hidden state generated in the last step by the encoder is passed as an input to the decoder. The aim of the attention mechanism is to address this challenge.

Attention takes as its input all the hidden states generated by the encoder. At each step in the decoder the RNN uses all the hidden states from the encoder to decide which part of the input sequence the decoder should focus on. Let c be a function applied on the encoder hidden states h0, h1, h2 ….hn for state t. With attention the equation for decoder changes to
hₜ = f(ht-1, Xₜ, c)
In this new function typically a weighted sum of all the encoder hidden states are passed to the decoder RNN. The weights are a softmax on the actual weights obtained using different techniques such as:
- To get the weights a shallow neural network may be used that takes as its input the hidden states generated by the encoder, and outputs the weight to be applied on the decoder hidden states. The network can be trained with the rest of the Seq2Seq architecture.
- Cosine similarity of all the encoder hidden states with the current hidden state of the decoder is another alternative way to get the weights.
Transformers: Attention is all you need
Recurrent Neural Networks suffer from the drawback that they process the text sequentially, because of which they are not amenable to parallel processing specifically using modern hardware accelerators such as Graphical Processing Unit (GPUs). In their seminal work published in 2017, Vaswani et al proposed that instead of using a sequential network we can use a normal feed forward neural network for the Encoder Decoder architecture, along with a type of attention. The resulting architecture is called a transformer.

Transformers do not use a Recurrent Neural Network (RNN) or a Convolution Neural Network (CNN). To include information about the sequential nature of text, Transformers include information about the position of the token in the embedding. This is called positional encoding.
In the encoder, the positional encoding is passed through a unique multi-head attention mechanism. The result from the multi headed attention is joined with the original input to the attention layer, normalised and then passed through a normal feed forward neural network. In the decoder the input is shifted by one place, merged with positional information, passed through one layer of multi-head attention. The result is merged with the output from encoder, passed through another layer of multi-head attention, and then finally passed through a feed forward neural network.
The goal of the attention mechanism is to map a Query and a Key-Value pair, to obtain the weights that indicate the likelihood of finding the answer to the query from a key-value pair. The Query, and Key-Value pair are obtained from the input in the first mult-head attention layer of both the Encoder and the Decoder. For the second multi-head attention layer of the decoder, the Query comes from the decoder input (after passing through the first multi-head attention layer), and the key-value pairs are taken from encoder output.
In the next part of this series we will look at Bidirectional Encoder Representations from Transformers (BERT) a popular word embedding, and how it can be used to develop Question Answer models.