Alibaba-MIIL / ImageNet21K

Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(NeurIPS, 2021) paper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pretrain head of vit

cissoidx opened this issue · comments

In the vit paper, it says:

The classification head is implemented by a MLP with one hidden layer at pre-training
time and by a single linear layer at fine-tuning time

So if you are using timm package, they define the head like this:
https://github.com/rwightman/pytorch-image-models/blob/d3f744065088ca9b6b3a0f968c70e90ed37de75b/timm/models/vision_transformer.py#L293

Did you reach the stats in your paper using single layer head or head of one hidden layer?

commented

i am not sure i understand the difference.

anyway, i have a reproduction code for Mixer for finetuning
https://github.com/Alibaba-MIIL/ImageNet21K/blob/main/Transfer_learning.md

I believe it will work also for ViT

thanks for reply.
What I meant is that according to the paper, pretraining and finetuning have different heads. But your implementation seems to have the finetuning head only while you are doing pretraining.
Anyway, it might not be an issue since you reached sota. I am just asking if you have noticed.

commented

if you have a suggestion\code for further improvement, feel welcome to share :-)

google-research/vision_transformer#124
will see what google will reply.

google-research/vision_transformer#124 (comment)
representation_size is the answer!