What if the visual transformer does not have a class token?

Question

What if the visual transformer does not have a class token?

Tgaaly opened this issue 2 years ago · comments

I see the code in the VITAttentionGradRollout code requires a class token. What if the model architecture does not have a class token?

If for example my attention layer is 196x196 (corresponding to 14x14 spatial resolution), can one take the mean of all other patches w.r.t. to each patch, as follows: mask = result[0].mean(0)? I've tried this and I didn't get very meaningful results. Is there another way to deal with transformers without class tokens?