pretrain head of vit
cissoidx opened this issue · comments
In the vit paper, it says:
The classification head is implemented by a MLP with one hidden layer at pre-training
time and by a single linear layer at fine-tuning time
So if you are using timm package, they define the head like this:
https://github.com/rwightman/pytorch-image-models/blob/d3f744065088ca9b6b3a0f968c70e90ed37de75b/timm/models/vision_transformer.py#L293
Did you reach the stats in your paper using single layer head or head of one hidden layer?
i am not sure i understand the difference.
anyway, i have a reproduction code for Mixer for finetuning
https://github.com/Alibaba-MIIL/ImageNet21K/blob/main/Transfer_learning.md
I believe it will work also for ViT
thanks for reply.
What I meant is that according to the paper, pretraining and finetuning have different heads. But your implementation seems to have the finetuning head only while you are doing pretraining.
Anyway, it might not be an issue since you reached sota. I am just asking if you have noticed.
if you have a suggestion\code for further improvement, feel welcome to share :-)
google-research/vision_transformer#124
will see what google will reply.
google-research/vision_transformer#124 (comment)
representation_size
is the answer!