How much does ImageNet pre-training affect model performance?

Question

How much does ImageNet pre-training affect model performance?

fbragman opened this issue 2 years ago · comments

Hi,

I am trying to use the baseline model (Linear decoder) described in the paper as a baseline for some of my work. However, I do not have access to pre-trained ImageNet weights. My model is not able to learn, converging at around 0.25 mDICE on the training set of Cityscapes. This is after hyper parameter optimisation across SGD, Adam + different learning rate schedulers.

I was wondering if during your experiments, you saw similar levels of performance when you did not initialise your transformer backbones with pre-trained weights? Was this tested for the baseline (ViT + Linear) and your proposed method (ViT + Mask)?

Thank you

rstrudel · Answer 1 · Tue Nov 08 2022 16:10:41 GMT+0800 (China Standard Time)

Hi @fbragman ,

Thanks for your question. We did check the performance when training from scratch on ADE-20k and you can find results in the appendix of our paper. We ablate for ViT+Linear.

Pre-training is key in general for detection and localization tasks such as segmentation.
The main reason is that downstream task datasets (such as ADE-20k, Cityscapes or Pascal) are just too small compared to a classification dataset such as ImageNet or ImageNet-21k for example. More data is needed to get good performance with current deep learning models. This is true either with a CNN or a Transformer backbone.