Questions related to Compile the QuaRot on CPU and Model Saving

Question

Questions related to Compile the QuaRot on CPU and Model Saving

HuangOwen opened this issue 5 months ago · comments

Xijie Huang commented 5 months ago

Thanks for your awesome work! I have a few questions:

Is it possible to compile the QuaRot without cuda? I know that the fast Hadamard kernel requires GPU and it helps to keep the efficiency. But I can remove all the online hadamard operations and only perform weight modification. Is it possible to compile a CPU version?
After rotating a huggingface model, for example, llama2-7b-hf, (By calling the rotation_utils.fuse_layer_norms(model) androtation_utils.rotate_model(model, args)). I want to save the rotated model without quantization using model.save_pretrained(save_path). However, when I load it, the input_layernorm.weight has a different shape and I cannot load or use the model anymore. I understand that this is because RMSNorm in the original llama and the rotated llama are different. Is there a solution to save and load the rotated model using huggingface?

Looking forward to your replies!

Saleh Ashkboos · Answer 1 · Thu Jun 06 2024 23:16:02 GMT+0800 (China Standard Time)

@HuangOwen

Thank you so much for your issue.

Is it possible to compile the QuaRot without cuda? I know that the fast Hadamard kernel requires GPU and it helps to keep the efficiency. But I can remove all the online hadamard operations and only perform weight modification. Is it possible to compile a CPU version?

Yes. If you remove all online hadamards, then you will not use fast-hadamard-transform repo anymore. In addition, you can use this function to apply Hadamard (but this will be slow) which should be ok on CPU.

After rotating a huggingface model, for example, llama2-7b-hf, (By calling the rotation_utils.fuse_layer_norms(model) androtation_utils.rotate_model(model, args)). I want to save the rotated model without quantization using model.save_pretrained(save_path). However, when I load it, the input_layernorm.weight has a different shape and I cannot load or use the model anymore. I understand that this is because RMSNorm in the original llama and the rotated llama are different. Is there a solution to save and load the rotated model using huggingface?

I think you can solve this issue by first rotating the model and then loading the checkpoint.