horseee / LLM-Pruner

[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support LLaMA, Llama-2, BLOOM, Vicuna, Baichuan, etc.

Home Page:https://arxiv.org/abs/2305.11627

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Calculating Importance of 'param_mix'

kiucho opened this issue · comments

Hello. First of all, thank you for sharing great research.

I have a question about calculating the importance of parameters.

In class TaylorImportance in hf_llama_pruner.py, line 274,

  1. Could you please tell me why the importance about mixed-order is calculated as follow:
salience = salience - 0.5 * layer.weight * layer.weight.acc_grad * layer.weight

(not as the sum of 1st and 2nd orders)

  1. Is higher-order term neglected?

Hi kiucho,

  • For Question 1:

The derivation of this is (e.g., for eq.5 in the paper):
drawing
where
drawing

And thus, for here, it would be subtract the second hessian term from the first-order term.

There is a mistake in the first version of our paper and please refer to our code. We uploaded a new version of paper on arxiv (I'm not sure when it would be released, but I guess it would be available in the next 24 hours).

  • For Question2:

Yes, we can neglect higher-order terms because their impact is negligible due to their small scale compared to the preceding term. This is primarily because the first-order term always dominates, given that the model consistently remains not fully convergence when applied to our calibration samples (evidenced by the presence of a large loss during the pruning process)

Thank you for your kind explanation. I checked the new version of your paper. Thanks once again.