Asymmetry between the student and the teacher networks

Question

Asymmetry between the student and the teacher networks

GoGoDuck912 opened this issue 3 years ago · comments

DrPeikeLi commented 3 years ago

Great work and thanks for sharing!
Here I have two questions for the method design:

I notice that in the teacher network the codes are computed by L2 distance, while in the student network the codes are computed by the inner product (cosine). Any special insights into this?
In the student network, the dynamically generated words are l2-normed, however, the features from the backbone (S(x)) are not. May I ask why?

Spyros · Answer 1 · Sat Oct 30 2021 15:53:24 GMT+0800 (China Standard Time)

Hi @PeikeLi

Thank you very much for your interest in our work!

Indeed, for the teacher network we use the L2 distance for assigning the local features to visual words. We have also tried the cosine distance but we did not observe any significant difference. Thus, since the L2 distance was also what we used in our previous work, called BoWNet (https://arxiv.org/abs/2002.12247), on which we built upon, we stuck to that option.

Regarding the student. Again in our previous work BoWNet, we noticed that it is important to l2-normalize the weights because the distribution of the visual words on the dataset tends to be unbalanced (see section 2.3 of BoWNet). Similar behaviour was observed in OBoW, i.e., that l2-normalization leads to better results. On the other hand, we experimented with l2-normalizing the feature vectors generated by the backbone and we noticed a small drop in performance on the downstream task of linear classification. We suspect that this is because the way the backbone is used is altered, i.e., the backbone is pre-trained with l2-normalization that is not used on the downstream task.