karpathy / minGPT

It seems that the output of this block is simply reshaped from multiple heads. From the original "Attention is all you need" paper, it seems that there is another linear layer, W^O. Mind if I ask if this is intentional or an error? Thank you

This does have W^O, it's here:

minGPT/mingpt/model.py

Line 42 in 37baab7

self.c_proj = nn.Linear(config.n_embd, config.n_embd)

But I have a similar question. It turns out that this is equivalent to the transformers paper, but it's a bit tricky to understand.

The paper does the following:

takes the embedding vector n_heads times, and does a linear projection that reduces the size (3 times)
that results in the k, q, v for each head

You can think of the paper as doing 3 * n_head linear projections.

This repo instead does two things, all via c_attn:

calculates the k, q, v all at once
does this for every head

The paper instead does not slice up the embeddings but has a linear layer that maps the embedding to a smaller size. attention across the full embeddings, concatenates everything, and reduces the dimension down via W^O.

Fwiw, you can see him talking about this part here: https://youtu.be/kCc8FmEb1nY?feature=shared&t=4919

I found this excerpt from the paper clarifying:

Output of CausalSelfAttention