karpathy / minGPT

A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is it more reasonable to only use causal attention in the first block of GPT

charlesxu90 opened this issue · comments

Dear @karpathy,

Thanks for this nice GPT implementation. Really helps a lot!

When I comparing this GPT with other Transformers, I found that here all the attention layers was using causal self-attention. I'm wondering is it really needed, or we can just use causal self-attentiong in the first block, such as to avoid using future tokens in prediction.

I'm not sure if my idea is correct. But ideally, only using causal attention in the first block should avoid using the future tokens, as there're no residual connections between blocks.

Thanks for your attention.