Attention

This is one of the most celebrated applications in the field of deep learning. Basically what attention does is to mimic what our brain does. Human Brain does not process a whole visual snapshot of what it sees instead it looks at a particular portion of something and the process it sequentially over time. Applying the same principal to computer vision we can get really good results such as if we use an eye tracking with another camera to see what the eye sees we can represent to the machine how our brain is learning.

Sequence to Sequence

Application

As I have mentioned that sequence to sequence is one of the most celebrated applications of deep learning to get a better understanding of what we can do with sequence to sequence lets say we train it with english phrases and its translation to french then we have an english to french translator. Let us say we have train it on text conversation then we have a chat bot or if we train it on images and captions then we have and image captioning model and so on.

Encoders and Decoders

In a very basic level Sequence to Sequence model consist of an encoder and a decoder. The encoder can be a Cnn or an Rnn. It takes in an input and processes the information and passes this tensor of what it understood called the state to the decoder which in-turn would give the output.

If we consider the Encoder and Decoder as both RNN's then we can unfold them to see what is happening in each time-step and naturally they have loops and RNNs feed the output of one node to other till the last state where it gives the output as a tensor to the decoder which again is an RNN with its own network.

Limitation of Sequence to Sequence model

As we can see above the Encoder Rnn would have a specific length of hidden state in each step but at the end it only sends a single vector to the Decoder. This means that the model is not able to summarize a long sequence with a context vector which has small dimensions. This is what Attention solves.

Attention Overview

Encoder

Similar to a sequence to sequence vector without attention the encoder accepts a sequence of words and generate hidden state for each of the sequence and then it passes the context vector to the decoder but here the context vector comprises of all the hidden state generated by the encoder This means a larger sequence would have a larger context vector.

Decoder

The attention decoder will receive the data from the encoder and it would focus on the hidden vector of the corresponding word the decoder knows to do this focusing since it is pre trained to do so. For example lets say we are creating a french to english translator in this case there are instance where the order of the words would be different in different cases and since the decoder is trained on this patten it knows exactly when to skip words or go back to a word so that it can order it correctly and then it executes the rest sequentially.

Embedding

Encoder

Similar to how we would use a normal RNN with text data we can use a embedding layer here too, this would accept a world and would convert it to a vector which we can feed the rnn, The encoder can now move to

Decoder

In the attention decoder it would take in the context vector and use its own hidden state and then do the following , it would create a scoring mechanism and give score to each of the hidden units in the context vector and apply a softmax function on the scores. After this each hidden state is multiplied to its softmax score and all the vectors are added to give a context vector for the decoder. In a high level the decoder looks at the context vector and its hidden state and produces an output text and a new hidden state this continues till all the time step is over to complete our output sequence.

Bahdanau Attention and Loung Attention

Bahdanau Attention

Loung Attention

Multiplicative attention

As we have discussed earlier we use a scoring mechanism to to figure out which vector should we be focusing on. For multiplicative attention this scoring is basically dot product.

The dot product between a vector A and another vector B is basically AxBxCos(angle between A and B).

This can be used because cos has a property that it has a range from -1 to 1 depending on the angle between the vectors.

This scoring can be used to find a value and then use this value can pass though a softmax function which we can use to make the context vector.

The input to the context vector is basically the hidden state at that time step and the context vector from the encoder to do scoring we take transpose of the hidden state and then do dot product to the other context vector to maintain dimensionality.

This would work fine for models like summarization bot which would be the same language and same word embedding space but in case of translation bot this will not work out since the word embedding space is different so in order to do scoring here we incorporate a weight matrix to maintain dimensionality in the scoring function.

Additive attention

This method uses a feed-forward neural network to do the scoring, this is called concat scoring method. How this basically works is by concatenating the hidden state with the context vector in the first state and then we pass them thorough a simple neural network which has a single hidden layer and this outputs the sore. The parameters of this neural network is learned during the training process. The basic calculation done is we multiply the weight of the neural network to the concatenated vector and then apply tan to this and multiply this to another weight of similar dimension giving us the score.

The Transformer

This model is for simplifying what attention does by using only attention and no Rnn's.

attention is all you need

These models outperformed machine translation models in both quality and requiring significantly less time to train.

The transformer takes in a sequence of input and generates a sequence of output similar to the previous sequence to sequence models that we have seen but the difference is it can generate the inputs parallel all at once compared to generating outputs one by one.

The transformer also includes an Encoder and a Decoder. But instead of using RNN's they use feed-forward networks and concept called as Self Attention.

The advantage of this is using paralyzation compared which is not possible when using an RNN.

The transformer uses a stack of Encoders and Decoders.

N=6 is what is proposed in the paper linked above.

Each of these encoder consist of two layers

A multi headed self-attention layer.
A feed-forward layer.

The advantage of using this is that the encoder can focus on other parts of a sentence which is relevant while focusing on one part of the sentence.

This concept comes from research done on the side of self attention.

The Decoder contains two attention components

The Encoder-Decoder attention which allows it to focus on relevant parts of the input.

The Self-attention layer which focus on previous outputs.

In addition to this the decoder also has feed-forward network to generate output sequence.

How all this works under the hood.

The first step as usual would be to embed the words.

Then we look at the word that we are focusing on and then score that word and relevant other words.

The next step is to scale the score by

square-root(d _k)

Then we do a softmax on these and then we multiply the softmax to the current score and add the resultant vectors up which would produce the self attention context vector.

If we are judging the words based on the embedding this would only look at other similar words after scoring to improve this we need to make a modification.

We create queries to each embedding this can be done by multiplying each embedding to a query matrix or passing through a query feed-forward neural network.

Then we also create keys.

The Scoring would be done by comparing the Queries and Keys.

Then we scale the score by square-root(d _k)

do a softmax and multiply the softmax score by the key.

Then we add all the resultant vector up.

abhijitramesh / Attention

Attention

Sequence to Sequence

Application

Encoders and Decoders

Limitation of Sequence to Sequence model

Attention Overview

Encoder

Decoder

Embedding

Encoder

Decoder

Bahdanau Attention and Loung Attention

Bahdanau Attention

Loung Attention

Multiplicative attention

Additive attention

The Transformer

How all this works under the hood.

Notebook

About

Languages