HobbitLong / CMC

[ECCV 2020] "Contrastive Multiview Coding", also contains implementations for MoCo and InstDis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about NCEAverage.py

vinsis opened this issue · comments

Hi @HobbitLong , thank you for releasing the code. I wanted to ask a few questions regarding the implementation of NCEAverage.py. I understand some of them might be pretty basic questions but hopefully the answers will also help others to understand the code + implementation better.

  • What is the purpose of T=0.07 and why do out_l and out_ab need to be divided by T?

* Is there any advantage of starting out with unit vectors (on average) by implementing stdv = 1. / math.sqrt(inputSize / 3) here. I say this because out_l and out_ab need to be normalized anyway as is done here.

* Is this correct that you use a moving average (MA) to update weight_l and weight_ab (instead of just copying the values directly) because the model itself is learning and the values l and ab can be noisy? Using a MA reduces variance.
* As a follow up, how would this implementation be possible if you were not using memory banks? Is this an incidental advantage of using a memory bank?

  • [Resolved] Why did you not use a gradient descent based method to implement NCE? Was it done to reduce the overload of all things that needed to be learnt?

  • [Resolved] Lastly, since NCEAverage has no parameters or nn layers, I believe you don't need with torch.no_grad() here.~~

Thank you again.

Update: I realized torch.no_grad() is to prevent the issue of buffers being cleared.

In reference to the last point, after some thought I think I realize why torch.no_grad() is used. It is to make sure the gradients associated with l and ab are not stored in the memory bank.

In reference to the third point, I think one need not define a context as was done in CPC. A simple dot product followed by likelihood maximization should be enough. There is no parameter to be optimized.

I think I figured out answers to most of the questions I asked. However I still don't know the purpose of T. I would really appreciate it if you could answer. Thanks.

Hi, sorry for missing the message.

So T is used to adjust the dynamic range of the score before softmax-ce. As the features are normalized, inner product only gives you something between -1 and 1, which is insufficient for discriminative positives from negatives. So T plays the role to adjust the range.

Ah I see. So T is a hyperparameter whose value was decided empirically I guess. Thank you!
Closing the issue now.