ajabri / videowalk

Repository for "Space-Time Correspondence as a Contrastive Random Walk" (NeurIPS 2020)

Home Page:http://ajabri.github.io/videowalk

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about single feature map training

BoPang1996 opened this issue · comments

Thanks for the great job and for sharing the code.

About the single feature map training, I have some questions. I will be appreciated if you can share the answers.

  1. To get rid of the boundary artifacts, did you try to use "reflect" or "circular" padding mode? Is it helpful?
  2. From my point of view, the short cut is caused by not only the boundary artifacts but also the shared computation among the output features. What do you think are the primary cause of the short cut?
commented

For the first question: the default padding mode is reflect if args.model_type is set to scratch (in utils/__init__.py#L291). And the author mentions "zeros or reflect, doesn't change much; zeros requires lower temperature" in resnet.py#L28. I am not sure it is about the "single feature map training" case but I think the author has tried different padding mode but their performance is similar.

And I am also curious about the author's response about the second question.

Hi, thanks for your interest!

Regarding the training with single feature map case:

I think there is more than one way the network can find a shortcut solution. In general, different approaches to prevent the shortcut lead to qualitatively different (and IMO, suboptimal) solutions. You can visualize this with the pca feature visualization (one of the visualizations shown in visdom if you set --visualize.

  1. I tried different padding strategies (also tried randomizing the padding strategy), but this did not fix the issue. I think the presence of boundary information leads to a shortcut that seems to use distance to boundaries to encode position information.

  2. Indeed, I think another shortcut has to do with size of receptive field, and shared computation. I found that deeper networks could minimize the objective more easily. One thing to keep in mind is that we are summing the objective over all the coordinates in the feature map (since we tracking every point). So there are many hard tasks that involve tracking the sky / background, etc, which leads to a data imbalance problem. A larger receptive field allows you to learn features that use global cues in the image, like vanishing lines or dominant edges in the scene, in a way that allows the network to satisfy many of the constraints while ignoring objects in the scene, especially smaller objects.

Using a modified architecture (less deep, wider, stride 1, no padding) can indeed mitigate some of these issues, as well as training on higher-resolution images (i.e. 512x512) such that the receptive field does not cover the whole image. Another idea I considered is using the entropy of the transition distribution to weigh the loss (intuitively, you can imagine that high-entropy transition distribution means there are many similar nodes in the graph, so less likely to be a small object). But I haven't investigated these thoroughly.

I would be interested to hear if you guys have any ideas! It's a bit mysterious, and resolving this issue would open a lot of potential followups that involve more dense learning objectives.

commented

Thanks for your response!