Quantizing models with bitsandbytes
raunaks13 opened this issue · comments
Raunak Shah commented
I had a couple of questions related to exactly how the quantization is done:
- Does bitsandbytes do fp16 -> int8 quantization after transferring the tensors to the GPU? And if you want to dequantize, are those operations done on the GPU as well?
- I traced the workflow of of
Linear8BitLt()
, which leads me to believe that quantization is happening in this line - https://github.com/TimDettmers/bitsandbytes/blob/main/csrc/kernels.cu#L2419. Could someone please confirm this? If not, where is the quantization occuring? - Is the quantization method used absmax or zero point? Is it done row-wise? There is some mention of column-wise features, but when I load quantized models with huggingface the scale factors seem to be different for each row, not column.
- When you quantize a model, do you treat outliers separately as described in the LLM.int8() paper? If so, then where does this happen in the source code?