Begin: I still don't have an intuitive understanding of attention. I have watched countless videos but I still can't explain it without fumbling at some point and being confused.
Idea: Implement common attention mechanisms to understand.
Process: I started with the most intuitive video series about attention that I have ever watched, named "Rasa Algorithm Whiteboard - Transformers & Attention", on YouTube. I created the arrays as they explained and used for loops or list comprehensions to construct or do operations on the matricies. This was how I came up with the initial inefficient explanation. This soon got more complicated and seemed to be the wrong approach as the calculations for the Q, V and K matricies were involved. I looked for ways to solve this from linear algebra, a few trial and error experiments and some help from this guide on medium, I vectorized the process and came up with KVQ_selfattention. Writing down shapes where ever possible helped me keep track of what's going where
End: I understand attention from a very intuitive level. I had to brush up on matrix multiplication, vectorization, etc to implement this at a relatively lower level. I'd highly suggest anyone else to try this exercise. If your intention is to learn how attention works, please don't look at my code unless you've spent a couple hours trying to implement it yourself. If you just want a layer that can contextualize your embeddings use the SelfAttention
module from SelfAttention.py
, or if you want trainable parameters in your attention block, use KVQ_selfattention
from KVQ_selfattention.py
.
You can also look at self_attention_forloop.py
to see an unefficient (but easier to read and comprehend) implementation of self-attention.