Transformer using pytorch
- Location Infomation
- First word : 0, Last word : 1
- Assign a linearly number
- Constant distance
- Has a unique value
# Sinusoidal positional encoding
w_k = 1 / 10000 ** (2k / d)
f(t)^i = sin(w_k), if i = 2k
cos(w_k), if i = 2k + 1
-
Look at words in different locations in the input sentence.
-
Query : word at current location
-
Key : word in different location.
-
Value : search for related word
x = query * key
x = x / sqrt(key size) # stable gradient
x = softmax(x)
x = x * value
- Parallel calculation of attention
- See from each point of view
- Decoder can't see the future
- Decoder sees only the previous words and present word
I Love you
I 1 0 0
Love 1 1 0
you 1 1 1
- attention shuffling ??
x = linear(x)
x = relu(x)
x = linear(x)