Why we need to calculate residual connections when visualize attention maps?

Question

Why we need to calculate residual connections when visualize attention maps?

JamenceTom opened this issue 3 years ago · comments

JamenceTom commented 3 years ago

Thanks for your great job!
I am curious why we need to calculate residual connections when visualizing attention maps?

Vishal Goklani · Answer 1 · Thu May 12 2022 23:26:49 GMT+0800 (China Standard Time)

I'm curious too! Why do we need this?

JimEverest · Answer 2 · Mon Feb 13 2023 23:49:29 GMT+0800 (China Standard Time)

Same question, hi @jeonsworld, could you please help to elaborate any specific reason for adding this identity matrix? Much appreciate.

SeungHyun104 · Answer 3 · Fri Nov 17 2023 12:30:06 GMT+0800 (China Standard Time)

In my opinion, In ViT's transformer module, It has residual connection. So from layer 1 to 12, Actual Attention map which ViT Model use is residual attention map.