Why we need to calculate residual connections when visualize attention maps?
JamenceTom opened this issue · comments
I'm curious too! Why do we need this?
Same question, hi @jeonsworld, could you please help to elaborate any specific reason for adding this identity matrix? Much appreciate.
In my opinion, In ViT's transformer module, It has residual connection. So from layer 1 to 12, Actual Attention map which ViT Model use is residual attention map.