spcl / QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Home Page:https://arxiv.org/abs/2404.00456

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to get models with only offline rotation (or models for weight-only quantization)

Tracin opened this issue · comments

Previous chat is here. #22

Let me describe more about this. The only function I call is rotate_model, I skip the layernorm fusion and activation quantization. I just save the model once rotation finished, want to make sure two models are 'same'.
For two reasons you mentioned, I think I have dealt with them.
For reason 1, I remove rotate_ov_proj call from rotate_model here.
For reason 2, I remove online Hadamard for down_proj here
Could you guide me anything else for me to do?

Thanks! @sashkboos

Thanks @Tracin for your issue

You cannot ignore layernorm_fusion as the whole offline rotation relies on the idea of having RMSNorm without weights (check section 3.4 in the paper).

@sashkboos Thanks! I just realize that. XQDiag(a) * Q^TW can not make QQ^T canceled but XQ * Q^TDiag(a)W can. That's cool.

@Tracin
Can you please explain a little more in detail how you altered the script to work with LLama3 for offline weight-only quantization, or if you have a script that would be appreciated?

@telemorne Sure, first remove all online hadamard operations described in this issue.
In layernorm_fusion, you need to set LN weights to 1.0 (bias to 0.0 if have any) after fused.
Then save_pretrained.

@Tracin Sorry for the numerous questions, one last thing which class did you use to load the model using the from_pretrained function(is it LlamaForCausalLM?) because I am facing some dimensionality issues?

@Tracin Sorry for the numerous questions, one last thing which class did you use to load the model using the from_pretrained function(is it LlamaForCausalLM?) because I am facing some dimensionality issues?

Yes, please save model using LlamaForCausalLM from transformers.