ajabri / videowalk

Repository for "Space-Time Correspondence as a Contrastive Random Walk" (NeurIPS 2020)

Home Page:http://ajabri.github.io/videowalk

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Best feature

Zhongdao opened this issue · comments

Hi Allan,
Great work!
I see in the test code, by default the layer4 of ResNet is removed.
May I know if it is also true when training?
Or train with layer4 but test with layer3 is better?

Hi, thanks for your interest!

First, I should note that it only makes sense to transfer the layer3 features for pretrained nets, like the ImageNet-pretraining and MoCo baselines; this is because they are trained with layer3 having stride 2, so if we don't remove layer4 but change stride of layer3 to stride 1, the features fed to layer4 will be OOD.

Also, note that there is --remove-layers flag for training as well; by default, we train with layer4, so the default for this flag is an empty list. As I mentioned in the appendix, we found that transferring layer3 features also seemed to work better the reported dense correspondence tasks; you can also try testing with layer4, and will find a drop of a few points. For fairness, I also report the results for UVC by re-evaluating using my code (using layer3 output), which leads to a slight relative improvement as well (as can be seen in Table 1 of https://arxiv.org/abs/2006.14613 v2).

I have not tried training without layer4.

Many thanks!

It's intuitive that layer3 features outperform layer4 features since mid-level features are better for tracking.
The inconsistency between training and testing is interesting (or a little weird). Maybe it's another open problem.

By the way, have you tried testing on the single object tracking dataset OTB, as done in UVC? I see they did not release this part of code. I tried and failed to reproduce the OTB results. Do you have any plan for this?

RE inconsistency: True, though I suspect that specializing the architecture further for the task would also lead to even more improvement. But the aim here was to change the ResNet arch minimally, so that this objective can be combined with others in a modular manner.

RE OTB: I have not, and I was not planning on it. Do you think this would be a valuable experiment? We were also considering transferring the features to a detection task (another experiment frequently requested).

@ajabri
I think current experiments are sufficient to well support the idea. I am just curious about the performance of such self-supervised models on SOT benchmarks because I am from the tracking community (The results seem quite good).

As for transferring the features to a detection task, in my personal opinion, I think it makes less sense than testing on tracking tasks, because the correspondence learning framework is not designed for discovering objects.
I do not think it's necessary to compare with those self-supervised learning methods that learn a generic representation (MoCo etc.) by transferring features. The biggest advantage of the correspondence learning framework is that features can be directly used for tracking.
But it would be interesting if results on detection tasks are better than existing methods ...

Thanks for the feedback. While I agree that the learned representation is well-suited for tracking tasks, in that we are modeling factors of variation of the same instance, I suppose our philosophy is that there is a spectrum between instances and (semantic) categories -- perhaps being good at robustly, densely tracking parts of the scene requires generically modeling object boundaries and 'objectness'. In this sense, tracking is more of a means to an end for learning useful representations of the environment, rather than the other way around.

That seems to make sense, looking forward to your future work!