sdc17 / UPop

[ICML 2023] UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers.

Home Page:https://dachuanshi.com/UPop-Project/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about accumulated gradients metric

Hambaobao opened this issue · comments

Dear author,

Hello, I have read your paper and code. UPop uses the cumulative gradient of mask as metric to evaluate the weight importance. However, I don't understand why UPop prunes the parts with large cumulative gradients. Does it mean that the parts with larger cumulative gradients are less important? Is there any related research supporting this, or is it based on intuition? Could you please provide some clarification?

Thank you.

By the way, could you please explain the 'compression_weight' in the code? Why is 'compression_weight' for attention set to 36? Does this number have any special significance?

Hi, @Hambaobao

I don't understand why UPop prunes the parts with large cumulative gradients. Does it mean that the parts with larger cumulative gradients are less important?

UPop prunes parts with large cumulative gradients of corresponding learnable masks $\zeta$, these masks are initialized to ones and the $l_{1}$-norm of masks are added as additional loss items to drive them smaller:

$$ \mathcal{L} = \mathcal{L_{O}} + w_a\sum\nolimits_{\zeta_{i} \in \zeta_a} \lVert \zeta_{i} \rVert_{1} + w_m\sum\nolimits_{\zeta_{i} \in \zeta_m} \lVert \zeta_{i} \rVert_{1} $$

, which makes masks $\zeta$ corresponding to the unimportant parts smaller if regular optimizers are used. However, it does not satisfy our expectation, i.e., to freely control their values at each iteration t. To this end, we use a custom rule to update masks, and they can no longer be used as a metric of importance because their values themselves are determined by our custom rule. As an alternative, their gradients are still obtained normally by the autograd engine of PyTorch and it is natural that gradients are served as the metric of importance.

Is there any related research supporting this, or is it based on intuition?

For using gradients as a metric of importance, you may refer to this paper, but their motivations and specific uses of gradients are quite different.

could you please explain the compression_weight in the code? Why is compression_weight for attention set to 36

compression_weight is used for unified ranking on different structures. And the scale factor 36 is determined by the shape of the masks for the different structures, i.e. their granularity. More specifically, the reason why compression_weight for attention is set to 36 is that each position in learnable mask $\zeta_a$ corresponds to 36 rows/columns in attention weights:

$$ 36 = 12 \text{(number of heads)} * [1 \text{(weights of query)} + 1 \text{(weights of key)} + 1 \text{(weights of value)]} $$

, while each position in learnable mask $\zeta_m$ corresponds to 1 row/column in FFN weights. Thanks for your questions. We will add some comments about this to the code.

Thank you very much for your detailed reply, it has truly been a great help to me.