Questions about NCEAverage.py
vinsis opened this issue · comments
Hi @HobbitLong , thank you for releasing the code. I wanted to ask a few questions regarding the implementation of NCEAverage.py
. I understand some of them might be pretty basic questions but hopefully the answers will also help others to understand the code + implementation better.
- What is the purpose of
T=0.07
and why doout_l
andout_ab
need to be divided byT
?
* Is there any advantage of starting out with unit vectors (on average) by implementing stdv = 1. / math.sqrt(inputSize / 3)
here. I say this because out_l
and out_ab
need to be normalized anyway as is done here.
* Is this correct that you use a moving average (MA) to update weight_l
and weight_ab
(instead of just copying the values directly) because the model itself is learning and the values l
and ab
can be noisy? Using a MA reduces variance.
* As a follow up, how would this implementation be possible if you were not using memory banks? Is this an incidental advantage of using a memory bank?
-
[Resolved] Why did you not use a gradient descent based method to implement NCE? Was it done to reduce the overload of all things that needed to be learnt?
-
[Resolved] Lastly, since
NCEAverage
has no parameters or nn layers, I believe you don't needwith torch.no_grad()
here.~~
Thank you again.
Update: I realized torch.no_grad()
is to prevent the issue of buffers being cleared.
In reference to the last point, after some thought I think I realize why torch.no_grad()
is used. It is to make sure the gradients associated with l
and ab
are not stored in the memory bank.
In reference to the third point, I think one need not define a context as was done in CPC. A simple dot product followed by likelihood maximization should be enough. There is no parameter to be optimized.
I think I figured out answers to most of the questions I asked. However I still don't know the purpose of T
. I would really appreciate it if you could answer. Thanks.
Hi, sorry for missing the message.
So T
is used to adjust the dynamic range of the score before softmax-ce. As the features are normalized, inner product only gives you something between -1 and 1, which is insufficient for discriminative positives from negatives. So T plays the role to adjust the range.
Ah I see. So T
is a hyperparameter whose value was decided empirically I guess. Thank you!
Closing the issue now.