pretrain head of vit

Question

pretrain head of vit

cissoidx opened this issue 3 years ago · comments

In the vit paper, it says:

The classification head is implemented by a MLP with one hidden layer at pre-training
time and by a single linear layer at fine-tuning time

So if you are using timm package, they define the head like this:
https://github.com/rwightman/pytorch-image-models/blob/d3f744065088ca9b6b3a0f968c70e90ed37de75b/timm/models/vision_transformer.py#L293

Did you reach the stats in your paper using single layer head or head of one hidden layer?

Tal · Answer 1 · Wed Jul 28 2021 13:25:05 GMT+0800 (China Standard Time)

i am not sure i understand the difference.

anyway, i have a reproduction code for Mixer for finetuning
https://github.com/Alibaba-MIIL/ImageNet21K/blob/main/Transfer_learning.md

I believe it will work also for ViT

Xu DONG · Answer 2 · Thu Jul 29 2021 10:06:31 GMT+0800 (China Standard Time)

thanks for reply.
What I meant is that according to the paper, pretraining and finetuning have different heads. But your implementation seems to have the finetuning head only while you are doing pretraining.
Anyway, it might not be an issue since you reached sota. I am just asking if you have noticed.

Tal · Answer 3 · Thu Jul 29 2021 13:04:06 GMT+0800 (China Standard Time)

if you have a suggestion\code for further improvement, feel welcome to share :-)

Xu DONG · Answer 4 · Thu Jul 29 2021 14:02:15 GMT+0800 (China Standard Time)

google-research/vision_transformer#124
will see what google will reply.

Xu DONG · Answer 5 · Fri Jul 30 2021 09:58:48 GMT+0800 (China Standard Time)

google-research/vision_transformer#124 (comment)
representation_size is the answer!