Error occurs when pruning LLaMa2-7b
moonlightian opened this issue · comments
With cmd like:
CUDA_VISIBLE_DEVICES=0 python hf_prune.py --base_model path_to_cached_hf_llama2-7b --pruning_ratio 0.25 --device cpu --eval_device cuda --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 --block_attention_layer_start 4 --block_attention_layer_end 30 --pruner_type taylor --test_after_train --taylor param_first --save_model
It throws an error :"addmm_impl_cpu_" not implemented for 'Half'
torch==2.0.0
transformers==4.31.0
Hi. Do you modify the code for loading the Llama2?
Ahh..The model code for llama2 needs to be modified to satisfy the updated attribute. Some of the dimension calculation is fixed in the official code, which is unsuitable for the inference of the pruned model.
Two ways to solve this bug:
- Modify the fixed attribute in the modeling_llama.py. The problematic attribute is the self.num_key_value_heads, and you can manually set it (below is an example):
for layer in model.model.layers:
layer.self_attn.num_heads = layer.self_attn.q_proj.weight.data.shape[0] // layer.self_attn.head_dim
- Use the code in this repo to load the model. I'm not sure why it is unsuccessful to load the model. If possible, could you plot the error msg here?
Thank you for your kind advice! It worked finally.