Some questions about implementation details
AndreevP opened this issue · comments
Pavel Andreev commented
Hello, thank you for an interesting paper and nice code.
I have two questions concerning implementation details.
- Does the "one-by-one" block reconstruction mentioned in the paper mean that input to each block comes from already quantized preceding blocks, i.e. each block may correct quantization errors coming from previous blocks? Or maybe input to each block is collected from the full-precision model?
- Am I correct in my understanding that in block-wise reconstruction objective you use gradients for each object in calibration sample independently (i.e. no gradient averaging or smth, like in Adam mentioned on the paper)? Besides, what is happening here in data_utils.py, why do you add 1.0 to the gradients?
cached_grads = cached_grads.abs() + 1.0
# scaling to make sure its mean is 1
# cached_grads = cached_grads * torch.sqrt(cached_grads.numel() / cached_grads.pow(2).sum())
Thank you for your time and consideration!
LIU, Shih-Yang commented
Hi, I also found point 2 confusing, have you figured out the rationale behind it?