jeonsworld / ViT-pytorch

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why we need to calculate residual connections when visualize attention maps?

JamenceTom opened this issue · comments

Thanks for your great job!
I am curious why we need to calculate residual connections when visualizing attention maps?
image

I'm curious too! Why do we need this?

Same question, hi @jeonsworld, could you please help to elaborate any specific reason for adding this identity matrix? Much appreciate.

In my opinion, In ViT's transformer module, It has residual connection. So from layer 1 to 12, Actual Attention map which ViT Model use is residual attention map.