ShoufaChen / AdaptFormer

[NeurIPS 2022] Implementation of "AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition"

Home Page:https://arxiv.org/abs/2205.13535

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Two questions about the experimental results in Tabel 1 of the paper.

yangzhen1997 opened this issue · comments

Hi, I would like to ask you two questions about the experimental results in the paper's Table 1.
I would like to ask where the acc 53.97 of full tuning of ssv2 was obtained?
截屏2022-10-10 17 56 13
When I read VideoMAE, I found that pretrain on ssv2 and then finetune on ssv2 can get 69.3 results. I know your paper is using the K400's pretrain parameters, but I also did experiments and I can achieve 65+ results with 50 rounds finetune on ssv2:
1

  1. so my first question is where did you get 53.97 from?
  2. The second question is that the data in the chart below I also did not find in the table, is it written wrong?

截屏2022-10-10 18 10 30

Hi,

Thanks for your raised question.

  1. May I know your detailed configuration including command and pre-trained weights?

  2. Good catch. We are sorry for that typo. We updated the table while missing the main text. Thanks again for pointing it out and we've fixed it in our camera-ready version.

Thanks for your reply. Did you experiment with the VideoMAE codebase?

I guess you experiment with strong data augmentation and optimizer (e.g. AdamW). For fair comparison to linear probling, We experiment with the same setting as linear probe, which uses SGD and does not contain strong data augmentation.

Please let me know if I miss something.

Thanks for your reply. I will do the experiment to verify it again!

@ShoufaChen @yangzhen1997 I was also experiencing the same problem. Even though you removed those augmentations and AdamW optimizer, will your method still be able to improve the results? Based on my experiments, adding augmentations and AdamW optimizer did not improve (and sometimes degraded) the performance. This is because in full fine-tuning, they are used to reduce the model overfitting when tuning many parameters. However, in VPT and your method, since we are only tuning a small fraction of parameters, it does not improve model performance. Therefore, would it be fair to report the full fine-tuning results without any augmentations and sophisticated optimizers?