Dequantizing int8 models to fp16

Question

Dequantizing int8 models to fp16

raunaks13 opened this issue 4 months ago · comments

I have loaded an LLM in huggingface with load_in_8bit=True.
I noticed the objects in the state_dict are structured something like

model.layers.18.self_attn.k_proj.weight
model.layers.18.self_attn.k_proj.SCB
model.layers.18.self_attn.k_proj.weight_format

The SCB and weight_format are present only in the quantized model. I think SCB refers to scale and bias that can help us in recreating the original tensor? weight_format is a string that says “row”. The huggingface integration guide mentions a .CB field in addition to the .SCB field, but I could not find it in the state_dict. Not sure if the codebase has changed since that was written?

Anyway, I am not sure about the exact method to dequantize the tensor to get back the original, but I tried the following:
(weight_SCB.unsqueeze(1) * weight)/127
This is giving a tensor that is close to the original model (what I get without adding the parameter load_in_8bit=True), but not the same.
I am not sure whether I am following the correct approach for dequantization. Would be great if someone could point me to some code or documentation on how I can recreate the exact original tensor from the weights.

As a follow up question, I know that for some models there are outlier values that are not quantized even though other values in the tensor are quantized. However I could not find this information in the state_dict. How can we find and handle these values during the dequantization process?

LuckMonkeys · Answer 1 · Thu Apr 25 2024 10:43:03 GMT+0800 (China Standard Time)

Hello, I came across the issue you posted about tensor dequantization and was wondering if there has been any progress since it was initially raised. Specifically, are you still using the method (weight_SCB.unsqueeze(1) * weight)/127 for dequantizing the tensor? If so, has it proved to be effective, or have you encountered any problems? I am facing a similar challenge and any insights you could share would be greatly appreciated.

Raunak Shah · Answer 2 · Thu Apr 25 2024 13:37:27 GMT+0800 (China Standard Time)

I think dequantizing the tensor to recover the exact original is not possible, since the quantization process involves rounding. When dequantization is used, however, I don't think the error matters that much based on the results I have seen in papers.
I haven't tested the impact of using the above formula on training/inference myself.

LuckMonkeys · Answer 3 · Thu Apr 25 2024 16:06:19 GMT+0800 (China Standard Time)

Thanks for the quick reply! I totally agree that perfect dequantization isn’t possible due to rounding during quantization. I’m planning to test the method (weight_SCB.unsqueeze(1) * weight)/127 you mentioned to see how it goes in practice. Appreciate your insights!