Is it more reasonable to only use causal attention in the first block of GPT

Question

Is it more reasonable to only use causal attention in the first block of GPT

charlesxu90 opened this issue 2 years ago · comments

Thanks for this nice GPT implementation. Really helps a lot!

When I comparing this GPT with other Transformers, I found that here all the attention layers was using causal self-attention. I'm wondering is it really needed, or we can just use causal self-attentiong in the first block, such as to avoid using future tokens in prediction.

I'm not sure if my idea is correct. But ideally, only using causal attention in the first block should avoid using the future tokens, as there're no residual connections between blocks.

Thanks for your attention.