Question about accumulated gradients metric
Hambaobao opened this issue · comments
Dear author,
Hello, I have read your paper and code. UPop uses the cumulative gradient of mask as metric to evaluate the weight importance. However, I don't understand why UPop prunes the parts with large cumulative gradients. Does it mean that the parts with larger cumulative gradients are less important? Is there any related research supporting this, or is it based on intuition? Could you please provide some clarification?
Thank you.
By the way, could you please explain the 'compression_weight' in the code? Why is 'compression_weight' for attention set to 36? Does this number have any special significance?
Hi, @Hambaobao
I don't understand why UPop prunes the parts with large cumulative gradients. Does it mean that the parts with larger cumulative gradients are less important?
UPop prunes parts with large cumulative gradients of corresponding learnable masks
, which makes masks
Is there any related research supporting this, or is it based on intuition?
For using gradients as a metric of importance, you may refer to this paper, but their motivations and specific uses of gradients are quite different.
could you please explain the
compression_weight
in the code? Why iscompression_weight
for attention set to 36
compression_weight
is used for unified ranking on different structures. And the scale factor 36 is determined by the shape of the masks for the different structures, i.e. their granularity. More specifically, the reason why compression_weight
for attention is set to 36 is that each position in learnable mask
, while each position in learnable mask
Thank you very much for your detailed reply, it has truly been a great help to me.