jacobgil / vit-explain

Explainability for Vision Transformers

Repository from Github https://github.comjacobgil/vit-explainRepository from Github https://github.comjacobgil/vit-explain

What if the visual transformer does not have a class token?

Tgaaly opened this issue · comments

I see the code in the VITAttentionGradRollout code requires a class token. What if the model architecture does not have a class token?

If for example my attention layer is 196x196 (corresponding to 14x14 spatial resolution), can one take the mean of all other patches w.r.t. to each patch, as follows: mask = result[0].mean(0)? I've tried this and I didn't get very meaningful results. Is there another way to deal with transformers without class tokens?