Questions about NCEAverage.py

Question

Questions about NCEAverage.py

vinsis opened this issue 4 years ago · comments

Hi @HobbitLong , thank you for releasing the code. I wanted to ask a few questions regarding the implementation of NCEAverage.py. I understand some of them might be pretty basic questions but hopefully the answers will also help others to understand the code + implementation better.

What is the purpose of T=0.07 and why do out_l and out_ab need to be divided by T?

* Is there any advantage of starting out with unit vectors (on average) by implementing stdv = 1. / math.sqrt(inputSize / 3) here. I say this because out_l and out_ab need to be normalized anyway as is done here.

* Is this correct that you use a moving average (MA) to update weight_l and weight_ab (instead of just copying the values directly) because the model itself is learning and the values l and ab can be noisy? Using a MA reduces variance.
* As a follow up, how would this implementation be possible if you were not using memory banks? Is this an incidental advantage of using a memory bank?

[Resolved] Why did you not use a gradient descent based method to implement NCE? Was it done to reduce the overload of all things that needed to be learnt?
[Resolved] Lastly, since NCEAverage has no parameters or nn layers, I believe you don't need with torch.no_grad() here.~~

Thank you again.

Vinay Sisodia · Answer 1 · Wed Feb 26 2020 11:03:28 GMT+0800 (China Standard Time)

Update: I realized torch.no_grad() is to prevent the issue of buffers being cleared.

In reference to the last point, after some thought I think I realize why torch.no_grad() is used. It is to make sure the gradients associated with l and ab are not stored in the memory bank.

Vinay Sisodia · Answer 2 · Wed Feb 26 2020 11:05:43 GMT+0800 (China Standard Time)

In reference to the third point, I think one need not define a context as was done in CPC. A simple dot product followed by likelihood maximization should be enough. There is no parameter to be optimized.

Vinay Sisodia · Answer 3 · Thu Feb 27 2020 15:40:34 GMT+0800 (China Standard Time)

I think I figured out answers to most of the questions I asked. However I still don't know the purpose of T. I would really appreciate it if you could answer. Thanks.

Yonglong Tian · Answer 4 · Thu Feb 27 2020 15:45:25 GMT+0800 (China Standard Time)

Hi, sorry for missing the message.

So T is used to adjust the dynamic range of the score before softmax-ce. As the features are normalized, inner product only gives you something between -1 and 1, which is insufficient for discriminative positives from negatives. So T plays the role to adjust the range.

Vinay Sisodia · Answer 5 · Thu Feb 27 2020 15:49:50 GMT+0800 (China Standard Time)

Ah I see. So T is a hyperparameter whose value was decided empirically I guess. Thank you!
Closing the issue now.