Why not using calibrated_grads directly?

ShuaiZ1037 opened this issue · comments


hello, I am very interested in your paper. thank you for the implementation. but I have some questions about your code。
in this line:

self.meta_weight = self.weight - \

self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
why not using the self.calibrated_grads directly? instead, you used the refine gradients: self.weight.grad.

furthermore, the weights have been updated in the main function using the refine gradients.
so i am very confused why using the refine gradients again!

Hi @ShuaiZ1037 ,
Thanks for your interest.

For your first question. Basically, what we want here is to update meta weights with previous weights and "refined gradients" (after calibration for pre-processing and refinement in optimizer). But it will comes with the following questions:

  1. If self.calibrated_grads is used directly, refinement (for Adam) will not be incorporated.
  2. If refined gradients (self.weight.grad) is used directly, meta network can not be incorporated (through calibrated_grads) into the computational graph since refined gradients will block its connection.

Therefore, here I use a rather trouble way to fulfill my functionality:

  1. Refined gradients is actually used in update of meta_weights.
  2. calibrated_grads is incorporated in computation such that meta network can be updated.
    That's the reason of line 86: calibrated_grads is added into computational graph while its value will be cancelled by "- self.calibrated_grads.data". What actually contributes to the update of meta_weight will be the refined gradient (self.weight.grad.data).

Think of using optimizer as SGD, where value of self.calibrated_grads is the same as self.weight.grad.data. While in Adam, things will be different.

For your second question, in line 86 I just add the self.calibrated_grads into computation for meta network's update. But I still need to "actually" update the value in base network. That is to say, line 86 will not update the real weights in the base network. Therefore I have to update in the main function using the refine gradients.

Indeed, it is a little bit tricky here. Hope this can solve your question. Let me know if it is still confusing.

Best regards,


Hi, @csyhhu, Thank you for your enthusiastic and quick response.
I get the motivation of using the two grads:

  1. calibrated_grads for meta-network updating.
  2. Refined gradients for incorporating refinement.

But, is there ang mismatch problem?

  1. self.weight.grad.data (the gradient of last step) VS self.calibrated_grads (the predicted gradient of this step). This can be explained as gradient accumulation in some ways. However, is there any mismatch problem in backward, which is explained below.
  2. the gradient of self.weight.grad computed from loss ---> self.calibrated_grads ---> meta-network.
    Empirically, the result of your paper shows its effectiveness.


Hi @ShuaiZ1037 ,

I don't think it can be regarded as mismatch. Since that is how normal optimization methods conduct:
Step 1: Weights receive "natural gradient", which is attained by chain rule backward propagation.
Step 2: "Natural gradient" is refined by optimization algorithm (such as Adam) to get self.weight.grad
Step 3: Actual weights update use the refined gradient self.weight.grad

My method simply puts "natural gradient" to generate the meta gradient and corresponding self.calibrated_grads, which can be regarded as a modification in Step 1. And I follow step 2&3 to finish the rest.

If you are refering that theself.weight.grad comes from the loss of self.calibrated_grads instead of the true loss of base model, that is correct and indeed gradient mismatch. However the loss of self.calibrated_grads also comes from the loss of base model.

Best regards,

@ShuaiZ1037 For the question "Why not using calibrated_grads directly?". If SGD is used, calibrated_grads can be used directly. For other optimization methods requiring refinement of the gradient, calibrated_grads needs to be further processed to follow the procedure of the corresponding optimization algorithm.


@csyhhu 你好,多谢你热心的回复,我明白你的意思,可能是我前面的表述有问题。

2.在self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach()) 中,我明白其目的与意义是将base model的loss的梯度传递到meta-net,让meta-net更新。如果直接使用self.calibrated_grads,当然关于loss的梯度很自然的回传到meta-net(当然,会存在您说梯度没有使用refine(Adam)的问题)。

self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())




@ShuaiZ1037 再次感谢您的兴趣和耐心。

self.weight.grad 的确是上一次迭代meta-net产生并refine的梯度,self.calibrated_grads其实也是对应了上一次迭代的梯度。因为self.calibrated_grads是由上一次迭代产生的pre_quantized_weight产生的。

在这段代码中:self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach()), 两个关键的梯度其实都是上一轮data产生的。这也符合梯度下降的意思:本轮用于计算loss的weights由上一轮的weights和上一轮产生的gradient加和而成。

你可以注意到,在line 252中,整个训练的第一次迭代是不会更新网络参数,后面才开始更新,所以有点“延迟更新”的感觉。

但这个应该对你困惑影响不大。但我还是不大理解您的困惑:“能将self.weight.grad.data的关于loss的梯度赋值给self.calibrated_grads吗?” 如果您是想问能否把self.weight.grad.data赋值给self.calibrated_grads,我似乎并没有这样的操作。self.calibrated_grads是来自pre_quantized_weight, 可参见:

def meta_gradient_generation(meta_net, net, meta_method, meta_hidden_state_dict=None, fix_meta=False):





@csyhhu 感谢您热心而迅速的解答。

代码:self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())

self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach()) 这样使用,前向传播中,使用一个梯度,而反向传播使用另外一个,那么反向传播的梯度难道不会不匹配吗?您解释了这么做的目的,但是我仔细阅读了paper和code,没有明白这么做的正确性,或者说这么处理为啥正确。



@ShuaiZ1037 感谢您的解释。
我理解您的意思了,我想了想,似乎的确存在self.weight.grad.dataself.calibrated_grads mismatch的问题。在本轮的更新中,self.weight.grad.data应该是要来自self.calibrated_grads ,而不是上一轮的self.calibrated_grads.

正确的做法应该是把line 239-249整体提前到line 226上面,这样就能解决mismatch的问题。





@csyhhu 非常感谢您的解答。
