lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Masked Auto Encoder, class token and linear probing

Gasp34 opened this issue · comments

Hello,

If I understand correctly, when doing linear probing, you only train the last FC layer.
But in the classification head of the ViT, the last FC layer uses the class token, that has not been trained during the MAE self-supervised task.
How can we expect to have good features in the class token if it has not been trained ?

Thanks

I was having a similar issue, but I believe it shouldn't be a problem - the original paper mentions that they achieve similar performance for linear probing when using average pooling. So since the class token is indeed not usable, I would suggest trying that option.

Good luck!