yhlleo / VTs-Drloc

NeurIPS 2021, Official codes for "Efficient Training of Visual Transformers with Small Datasets".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

compare to CvT

liyunsheng13 opened this issue · comments

Hi

Thanks for sharing this good work. I'm curious about why the proposed loss function can outperform CvT, which contains a depthwise convolution that is capable to learn local features.

Hi @liyunsheng13,

Good question. It's also similar on the results of ResNet, where there are only convolution layers. We guess that the proposed loss acts as a regularizer, which helps both VTs and CNNs learn local features better, especially in the earlier epochs. You're right the convolutional layers are capable to learn local features. In our experiments, we can see only marginally or the same performance with longer training.

Besides, if you check our supplementary materials, we also show a figure about the training with different VTs on CIFAR-100.
epochs

It shows that the gain from our loss for the architecture with depthwise convolution is obviously less.

Got it. Thanks for your response.