jjxxmiin / pytorch_transformer

Transformer using pytorch

pytorch_transformer

Transformer using pytorch

Positional Encoding

Location Infomation

First word : 0, Last word : 1
Assign a linearly number
Constant distance
Has a unique value

# Sinusoidal positional encoding

w_k = 1 / 10000 ** (2k / d)

f(t)^i  =   sin(w_k),  if i = 2k
            cos(w_k),  if i = 2k + 1

Self Attention

Look at words in different locations in the input sentence.
Query : word at current location
Key : word in different location.
Value : search for related word

x = query * key 
x = x / sqrt(key size) # stable gradient
x = softmax(x)
x = x * value

Multi Head Attention

Parallel calculation of attention
See from each point of view

Masked Multi Head Attention

Decoder can't see the future
Decoder sees only the previous words and present word

     I Love you
I    1   0   0
Love 1   1   0
you  1   1   1

Feed Forward

attention shuffling ??

x = linear(x)
x = relu(x)
x = linear(x)

Reference

About

Transformer using pytorch

Languages

Language:Jupyter Notebook 100.0%