Why not using calibrated_grads directly?
ShuaiZ1037 opened this issue · comments
hello, I am very interested in your paper. thank you for the implementation. but I have some questions about your code。
in this line:
self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
why not using the
self.calibrated_grads
directly? instead, you used the refine gradients: self.weight.grad
.
furthermore, the weights have been updated in the main function using the refine gradients.
so i am very confused why using the refine gradients again!
Hi @ShuaiZ1037 ,
Thanks for your interest.
For your first question. Basically, what we want here is to update meta weights with previous weights and "refined gradients" (after calibration for pre-processing and refinement in optimizer). But it will comes with the following questions:
- If self.calibrated_grads is used directly, refinement (for Adam) will not be incorporated.
- If refined gradients (self.weight.grad) is used directly, meta network can not be incorporated (through calibrated_grads) into the computational graph since refined gradients will block its connection.
Therefore, here I use a rather trouble way to fulfill my functionality:
- Refined gradients is actually used in update of meta_weights.
- calibrated_grads is incorporated in computation such that meta network can be updated.
That's the reason of line 86: calibrated_grads is added into computational graph while its value will be cancelled by "- self.calibrated_grads.data". What actually contributes to the update of meta_weight will be the refined gradient (self.weight.grad.data).
Think of using optimizer as SGD, where value of self.calibrated_grads is the same as self.weight.grad.data. While in Adam, things will be different.
For your second question, in line 86 I just add the self.calibrated_grads into computation for meta network's update. But I still need to "actually" update the value in base network. That is to say, line 86 will not update the real weights in the base network. Therefore I have to update in the main function using the refine gradients.
Indeed, it is a little bit tricky here. Hope this can solve your question. Let me know if it is still confusing.
Best regards,
Shangyu
Hi, @csyhhu, Thank you for your enthusiastic and quick response.
I get the motivation of using the two grads:
- calibrated_grads for meta-network updating.
- Refined gradients for incorporating refinement.
But, is there ang mismatch problem?
self.weight.grad.data
(the gradient of last step) VSself.calibrated_grads
(the predicted gradient of this step). This can be explained as gradient accumulation in some ways. However, is there any mismatch problem in backward, which is explained below.- the gradient of
self.weight.grad
computed from loss --->self.calibrated_grads
---> meta-network.
Empirically, the result of your paper shows its effectiveness.
thanks,
Shuai
Hi @ShuaiZ1037 ,
I don't think it can be regarded as mismatch. Since that is how normal optimization methods conduct:
Step 1: Weights receive "natural gradient", which is attained by chain rule backward propagation.
Step 2: "Natural gradient" is refined by optimization algorithm (such as Adam) to get self.weight.grad
Step 3: Actual weights update use the refined gradient self.weight.grad
My method simply puts "natural gradient" to generate the meta gradient and corresponding self.calibrated_grads
, which can be regarded as a modification in Step 1. And I follow step 2&3 to finish the rest.
If you are refering that theself.weight.grad
comes from the loss of self.calibrated_grads
instead of the true loss of base model, that is correct and indeed gradient mismatch. However the loss of self.calibrated_grads
also comes from the loss of base model.
Best regards,
Shangyu
@ShuaiZ1037 For the question "Why not using calibrated_grads directly?". If SGD is used, calibrated_grads can be used directly. For other optimization methods requiring refinement of the gradient, calibrated_grads needs to be further processed to follow the procedure of the corresponding optimization algorithm.
@csyhhu 你好,多谢你热心的回复,我明白你的意思,可能是我前面的表述有问题。
不好意思,我用汉语解释一下:
1.第一个问题是,在第t次迭代最后,使用Adam或者SGD算法的到refine的梯度,然后更新了每一个参数。在t+1次迭代仍然使用上诉refine的梯度计算self.meta_weight
,然后进行卷积。这个问题不大,关键是后面。
2.在self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
中,我明白其目的与意义是将base model的loss的梯度传递到meta-net,让meta-net更新。如果直接使用self.calibrated_grads
,当然关于loss的梯度很自然的回传到meta-net(当然,会存在您说梯度没有使用refine(Adam)的问题)。
但是如果是您代码所示的:
self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
那么,实际上,计算是使用的self.weight.grad.data得到最后的basemodel的损失,反向传播时,self.weight.grad.data
的关于loss的梯度,实际是赋值给了self.calibrated_grads
,然后传递到meta-net。
然而,self.weight.grad
是上一次迭代meta-net生成并refine的梯度,而self.calibrated_grads是当前迭代meta-net生成的梯度。
所以,我的困惑是能将self.weight.grad.data
的关于loss的梯度赋值给self.calibrated_grads
吗?
我运行了您的代码,结果也是如您paper所示很好的结果。
希望我表述清楚了我的意思,如果我存在的困惑是因为有什么常识性错误或者理解您的paper有问题,麻烦您指出来!
谢谢!
帅
@ShuaiZ1037 再次感谢您的兴趣和耐心。
self.weight.grad
的确是上一次迭代meta-net产生并refine的梯度,self.calibrated_grads
其实也是对应了上一次迭代的梯度。因为self.calibrated_grads
是由上一次迭代产生的pre_quantized_weight
产生的。
在这段代码中:self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
, 两个关键的梯度其实都是上一轮data产生的。这也符合梯度下降的意思:本轮用于计算loss的weights由上一轮的weights和上一轮产生的gradient加和而成。
你可以注意到,在line 252中,整个训练的第一次迭代是不会更新网络参数,后面才开始更新,所以有点“延迟更新”的感觉。
但这个应该对你困惑影响不大。但我还是不大理解您的困惑:“能将self.weight.grad.data的关于loss的梯度赋值给self.calibrated_grads
吗?” 如果您是想问能否把self.weight.grad.data
赋值给self.calibrated_grads
,我似乎并没有这样的操作。self.calibrated_grads
是来自pre_quantized_weight
, 可参见:
MetaQuant/meta_utils/helpers.py
Line 10 in 3169e0b
这个函数。
如果我还是未能解决您的问题,请继续询问~
祝好!
上宇
@csyhhu 感谢您热心而迅速的解答。
正如您所述的:self.calibrated_grads
是当前迭代(t)meta-net合成的,对应于上轮(t-1)的pre_quantized_weight
;同样self.weight.grad
上一次迭代(t-1)meta-net产生并refine的梯度,对应于上上轮(t-2)pre_quantized_weight
。所以两个梯度不是一回事。(不清楚我的理解有没有问题)
“能将self.weight.grad.data
的关于loss的梯度赋值给self.calibrated_grads
吗?”的意思是:
代码:self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
的目的是为了将base-model的梯度传递给meta-net,前向传播中,self.weight.grad.data
参与运算,但是反向传播过程中,关于损失的梯度(从某种程度上可以这么理解):(self.weight.grad.data).grad
(由于detach的使用,实际并不会有grad)传递(赋值)(self.calibrated_grads).grad
,然后传递到meta-net。
其实本质上我的疑惑,就是两个梯度并不是一回事,或者是分别两个迭代步骤的梯度。像代码:
self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
这样使用,前向传播中,使用一个梯度,而反向传播使用另外一个,那么反向传播的梯度难道不会不匹配吗?您解释了这么做的目的,但是我仔细阅读了paper和code,没有明白这么做的正确性,或者说这么处理为啥正确。
如果我的理解有问题,麻烦您指出来!再次感谢您不厌其烦的解答!
感谢!
帅
@ShuaiZ1037 感谢您的解释。
我理解您的意思了,我想了想,似乎的确存在self.weight.grad.data
与self.calibrated_grads
mismatch的问题。在本轮的更新中,self.weight.grad.data
应该是要来自self.calibrated_grads
,而不是上一轮的self.calibrated_grads
.
正确的做法应该是把line 239-249整体提前到line 226上面,这样就能解决mismatch的问题。
估计差别不大,我会做下相关实验。
非常感谢您的指正!
祝好!
上宇