ajabri / videowalk

Repository for "Space-Time Correspondence as a Contrastive Random Walk" (NeurIPS 2020)

Home Page:http://ajabri.github.io/videowalk

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

More details about the experiments to avoid the trivial shortcut solution

jiayao6 opened this issue · comments

Thank you so much for your great job and sharing the code.

In Supplementary, the Section C: Using a Single Feature Map for Training, you designed four experiments and tried to avoid the trivial solution, which is the network learns a shortcut relying on boundary artifacts.
I want to know more details about the first two experiments, i.e.

  1. removing padding altogether;
  2. reducing the receptive field of the network to the extent that entries in the center crop of the spatial feature map do not see the boundary; we then cropped the feature map to only see this region.

My questions are:
a) how do you remove the padding? Does it mean to set the value in padding place to zero, or set the length (or area) of padding to zero?
b) If you set the length of padding place to zero, what's the feature shape of the network outputs? I think the shape will be much smaller than the original one, how do you compare them?
c) Also if you set the length of padding place to zero, what is the difference between the experiment 1) and 2) ?

Looking forward to your reply, thank you!

@ajabri I've got the same question! How did you do the "patch feature extraction" from the feature map for the failing "shortcut" case? What was the feature map resolution? Sth like 14 x 14? 28 x 28?

Have you considered then directly 196 / 784 patches per image? Or have you pooled the feature maps prior to this extraction?

Thanks!

Thanks for the questions, apologies for the delay in my reply. You can check out issue #6 for related discussion.

a/b) I tried both. Using reflection padding or randomized padding did not seem to make a difference. In the case of removing padding all together (i.e. 'valid' convolution), one has to be prudent if using a network structure without skip connections, or do resizing before applying the skip connections; it is also harder to use deeper nets, since each convolution reduces the size of the feature map, but the feature maps of each time step are the same shape so comparing them is not an issue. It is also possible to resize the feature map after each convolution.

c) In this case, there is no effective difference between the two settings; the case where no padding is used is just more efficient. Though, for deeper networks, the effective receptive field becomes quite large (i.e the whole image), so at a certain point is impossible to select a point without connections to the border.

I should say that in the regime where the image resolution is larger (i.e. 480x480) and the receptive field is not that large (i.e. 100-200), the suboptimal solution does not always involve solely relying on the border; from my observations, one failure mode is relying on high-frequency features to discriminate between neighbouring points., such that these features dominate. A possible solution is to modify the objective to weight direct neighbours less aggressively than more distant points, or use an approach like BYOL to compute the space-time walk loss, such that we needn't necessarily make neighbouring points different (wherein the transition matrices can still be compute in the same way).