spcl / QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.

Home Page:https://arxiv.org/abs/2404.00456

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions related to Compile the QuaRot on CPU and Model Saving

HuangOwen opened this issue · comments

Thanks for your awesome work! I have a few questions:

  1. Is it possible to compile the QuaRot without cuda? I know that the fast Hadamard kernel requires GPU and it helps to keep the efficiency. But I can remove all the online hadamard operations and only perform weight modification. Is it possible to compile a CPU version?
  2. After rotating a huggingface model, for example, llama2-7b-hf, (By calling the rotation_utils.fuse_layer_norms(model) androtation_utils.rotate_model(model, args)). I want to save the rotated model without quantization using model.save_pretrained(save_path). However, when I load it, the input_layernorm.weight has a different shape and I cannot load or use the model anymore. I understand that this is because RMSNorm in the original llama and the rotated llama are different. Is there a solution to save and load the rotated model using huggingface?

Looking forward to your replies!

@HuangOwen

Thank you so much for your issue.

Is it possible to compile the QuaRot without cuda? I know that the fast Hadamard kernel requires GPU and it helps to keep the efficiency. But I can remove all the online hadamard operations and only perform weight modification. Is it possible to compile a CPU version?

Yes. If you remove all online hadamards, then you will not use fast-hadamard-transform repo anymore. In addition, you can use this function to apply Hadamard (but this will be slow) which should be ok on CPU.

After rotating a huggingface model, for example, llama2-7b-hf, (By calling the rotation_utils.fuse_layer_norms(model) androtation_utils.rotate_model(model, args)). I want to save the rotated model without quantization using model.save_pretrained(save_path). However, when I load it, the input_layernorm.weight has a different shape and I cannot load or use the model anymore. I understand that this is because RMSNorm in the original llama and the rotated llama are different. Is there a solution to save and load the rotated model using huggingface?

I think you can solve this issue by first rotating the model and then loading the checkpoint.