compare to CvT
liyunsheng13 opened this issue · comments
Hi
Thanks for sharing this good work. I'm curious about why the proposed loss function can outperform CvT, which contains a depthwise convolution that is capable to learn local features.
Hi @liyunsheng13,
Good question. It's also similar on the results of ResNet, where there are only convolution layers. We guess that the proposed loss acts as a regularizer, which helps both VTs and CNNs learn local features better, especially in the earlier epochs. You're right the convolutional layers are capable to learn local features. In our experiments, we can see only marginally or the same performance with longer training.
Got it. Thanks for your response.