csyhhu / MetaQuant

Codes for Accepted Paper : "MetaQuant: Learning to Quantize by Learning to Penetrate Non-differentiable Quantization" in NeurIPS 2019

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why not using calibrated_grads directly?

ShuaiZ1037 opened this issue · comments

commented

hello, I am very interested in your paper. thank you for the implementation. but I have some questions about your code。
in this line:

self.meta_weight = self.weight - \

self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
why not using the self.calibrated_grads directly? instead, you used the refine gradients: self.weight.grad.

furthermore, the weights have been updated in the main function using the refine gradients.
so i am very confused why using the refine gradients again!

Hi @ShuaiZ1037 ,
Thanks for your interest.

For your first question. Basically, what we want here is to update meta weights with previous weights and "refined gradients" (after calibration for pre-processing and refinement in optimizer). But it will comes with the following questions:

  1. If self.calibrated_grads is used directly, refinement (for Adam) will not be incorporated.
  2. If refined gradients (self.weight.grad) is used directly, meta network can not be incorporated (through calibrated_grads) into the computational graph since refined gradients will block its connection.

Therefore, here I use a rather trouble way to fulfill my functionality:

  1. Refined gradients is actually used in update of meta_weights.
  2. calibrated_grads is incorporated in computation such that meta network can be updated.
    That's the reason of line 86: calibrated_grads is added into computational graph while its value will be cancelled by "- self.calibrated_grads.data". What actually contributes to the update of meta_weight will be the refined gradient (self.weight.grad.data).

Think of using optimizer as SGD, where value of self.calibrated_grads is the same as self.weight.grad.data. While in Adam, things will be different.

For your second question, in line 86 I just add the self.calibrated_grads into computation for meta network's update. But I still need to "actually" update the value in base network. That is to say, line 86 will not update the real weights in the base network. Therefore I have to update in the main function using the refine gradients.

Indeed, it is a little bit tricky here. Hope this can solve your question. Let me know if it is still confusing.

Best regards,
Shangyu

commented

Hi, @csyhhu, Thank you for your enthusiastic and quick response.
I get the motivation of using the two grads:

  1. calibrated_grads for meta-network updating.
  2. Refined gradients for incorporating refinement.

But, is there ang mismatch problem?

  1. self.weight.grad.data (the gradient of last step) VS self.calibrated_grads (the predicted gradient of this step). This can be explained as gradient accumulation in some ways. However, is there any mismatch problem in backward, which is explained below.
  2. the gradient of self.weight.grad computed from loss ---> self.calibrated_grads ---> meta-network.
    Empirically, the result of your paper shows its effectiveness.

thanks,
Shuai

Hi @ShuaiZ1037 ,

I don't think it can be regarded as mismatch. Since that is how normal optimization methods conduct:
Step 1: Weights receive "natural gradient", which is attained by chain rule backward propagation.
Step 2: "Natural gradient" is refined by optimization algorithm (such as Adam) to get self.weight.grad
Step 3: Actual weights update use the refined gradient self.weight.grad

My method simply puts "natural gradient" to generate the meta gradient and corresponding self.calibrated_grads, which can be regarded as a modification in Step 1. And I follow step 2&3 to finish the rest.

If you are refering that theself.weight.grad comes from the loss of self.calibrated_grads instead of the true loss of base model, that is correct and indeed gradient mismatch. However the loss of self.calibrated_grads also comes from the loss of base model.

Best regards,
Shangyu

@ShuaiZ1037 For the question "Why not using calibrated_grads directly?". If SGD is used, calibrated_grads can be used directly. For other optimization methods requiring refinement of the gradient, calibrated_grads needs to be further processed to follow the procedure of the corresponding optimization algorithm.

commented

@csyhhu 你好,多谢你热心的回复,我明白你的意思,可能是我前面的表述有问题。
不好意思,我用汉语解释一下:
1.第一个问题是,在第t次迭代最后,使用Adam或者SGD算法的到refine的梯度,然后更新了每一个参数。在t+1次迭代仍然使用上诉refine的梯度计算self.meta_weight,然后进行卷积。这个问题不大,关键是后面。

2.在self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach()) 中,我明白其目的与意义是将base model的loss的梯度传递到meta-net,让meta-net更新。如果直接使用self.calibrated_grads,当然关于loss的梯度很自然的回传到meta-net(当然,会存在您说梯度没有使用refine(Adam)的问题)。

但是如果是您代码所示的:
self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
那么,实际上,计算是使用的self.weight.grad.data得到最后的basemodel的损失,反向传播时,self.weight.grad.data的关于loss的梯度,实际是赋值给了self.calibrated_grads,然后传递到meta-net。

然而,self.weight.grad上一次迭代meta-net生成并refine的梯度,而self.calibrated_grads是当前迭代meta-net生成的梯度。
所以,我的困惑是能将self.weight.grad.data的关于loss的梯度赋值给self.calibrated_grads吗?

我运行了您的代码,结果也是如您paper所示很好的结果。
希望我表述清楚了我的意思,如果我存在的困惑是因为有什么常识性错误或者理解您的paper有问题,麻烦您指出来!

谢谢!

@ShuaiZ1037 再次感谢您的兴趣和耐心。

self.weight.grad 的确是上一次迭代meta-net产生并refine的梯度,self.calibrated_grads其实也是对应了上一次迭代的梯度。因为self.calibrated_grads是由上一次迭代产生的pre_quantized_weight产生的。

在这段代码中:self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach()), 两个关键的梯度其实都是上一轮data产生的。这也符合梯度下降的意思:本轮用于计算loss的weights由上一轮的weights和上一轮产生的gradient加和而成。

你可以注意到,在line 252中,整个训练的第一次迭代是不会更新网络参数,后面才开始更新,所以有点“延迟更新”的感觉。

但这个应该对你困惑影响不大。但我还是不大理解您的困惑:“能将self.weight.grad.data的关于loss的梯度赋值给self.calibrated_grads吗?” 如果您是想问能否把self.weight.grad.data赋值给self.calibrated_grads,我似乎并没有这样的操作。self.calibrated_grads是来自pre_quantized_weight, 可参见:

def meta_gradient_generation(meta_net, net, meta_method, meta_hidden_state_dict=None, fix_meta=False):

这个函数。

如果我还是未能解决您的问题,请继续询问~

祝好!
上宇

commented

@csyhhu 感谢您热心而迅速的解答。
正如您所述的:self.calibrated_grads是当前迭代(t)meta-net合成的,对应于上轮(t-1)的pre_quantized_weight;同样self.weight.grad上一次迭代(t-1)meta-net产生并refine的梯度,对应于上上轮(t-2pre_quantized_weight。所以两个梯度不是一回事。(不清楚我的理解有没有问题)

“能将self.weight.grad.data的关于loss的梯度赋值给self.calibrated_grads吗?”的意思是:
代码:self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
的目的是为了将base-model的梯度传递给meta-net,前向传播中,self.weight.grad.data参与运算,但是反向传播过程中,关于损失的梯度(从某种程度上可以这么理解):(self.weight.grad.data).grad(由于detach的使用,实际并不会有grad)传递(赋值)(self.calibrated_grads).grad,然后传递到meta-net。

其实本质上我的疑惑,就是两个梯度并不是一回事,或者是分别两个迭代步骤的梯度。像代码:
self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach()) 这样使用,前向传播中,使用一个梯度,而反向传播使用另外一个,那么反向传播的梯度难道不会不匹配吗?您解释了这么做的目的,但是我仔细阅读了paper和code,没有明白这么做的正确性,或者说这么处理为啥正确。

如果我的理解有问题,麻烦您指出来!再次感谢您不厌其烦的解答!

感谢!

@ShuaiZ1037 感谢您的解释。
我理解您的意思了,我想了想,似乎的确存在self.weight.grad.dataself.calibrated_grads mismatch的问题。在本轮的更新中,self.weight.grad.data应该是要来自self.calibrated_grads ,而不是上一轮的self.calibrated_grads.

正确的做法应该是把line 239-249整体提前到line 226上面,这样就能解决mismatch的问题。

估计差别不大,我会做下相关实验。

非常感谢您的指正!

祝好!
上宇

commented

@csyhhu 非常感谢您的解答。
按照您的解释,前面的问题得到解决,再次感谢您今天迅速而耐心的解答!

祝好!