Output of CausalSelfAttention
whchan05 opened this issue · comments
It seems that the output of this block is simply reshaped from multiple heads. From the original "Attention is all you need" paper, it seems that there is another linear layer, W^O. Mind if I ask if this is intentional or an error? Thank you
This does have W^O, it's here:
Line 42 in 37baab7
But I have a similar question. It turns out that this is equivalent to the transformers paper, but it's a bit tricky to understand.
The paper does the following:
- takes the embedding vector n_heads times, and does a linear projection that reduces the size (3 times)
- that results in the k, q, v for each head
You can think of the paper as doing 3 * n_head linear projections.
This repo instead does two things, all via c_attn:
- calculates the k, q, v all at once
- does this for every head
The paper instead does not slice up the embeddings but has a linear layer that maps the embedding to a smaller size. attention across the full embeddings, concatenates everything, and reduces the dimension down via W^O.
Fwiw, you can see him talking about this part here: https://youtu.be/kCc8FmEb1nY?feature=shared&t=4919