[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC

Question

[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC

ftian1 opened this issue 8 months ago · comments

This RFC is to propose a Hugging Face-compatible yet flexible Weight Only Quantization (WOQ) format in INC, and then the model quantized by INC can be loaded by IPEX for further inference optimization.

Feature, Motivation and Pitch

As we know, WOQ is getting more attentions from the industry. There have had a lot of quantized WOQ models, like Llama-2-7B-Chat-GPTQ, whose format has been becoming the standard WOQ storage format. Therefore, we propose a Hugging Face-compatible, yet flexible WOQ format definition. With this, we can leverage community effort to get those WOQ models and can also easily extend for new WOQ algorithms in the future which may keep improving on the accuracy of LLMs.

Design

The WOQ quantized model is usually saved at HuggingFace model hub like below layout:

User needs a quantization_config to know which group_size, desc_act, and sym is used when generating such WOQ model. however, such info/fields are able to be calculated from the WOQ checkpoint's content.

So the WOQ checkpoint format is the key factor to consider. it's mainly consists of two parts:

checkpoint attributes like packed weight, scale, zero_points, group_idx (De facto standard in HuggingFace WOQ model)
how packed weight gets compressed, like compress dimension, zero point dimension (hardcoded in HuggingFace WOQ model but INC can be more flexible to generate such packed model)

NOTE: The fields marked with bold font is the one we are missing in current IPEX code.

In the industry, the common practice is to save the first part into model checkpoint. For the second part, output channel for compression dimension and input channel for zero point dimension is the default behavior. INC extends the second part to support input channel as compression dimension and output channel as zero point dimension. This extension can be converted to follow the default dimension.

Solutions

Solution 1 (Recommended)

Enhance INC to export the converted model format which can be identified by current IPEX implementation.

### INC export interface
def export_compressed_model(woq_model, ipex_format=True):
    ### convert WOQ model to be IPEX compatible one
    #
    # the converted checkpoint attributes include:
    #
    # 1. 'qweight' to 'packed_weight'
    # 2. 'scales' to 'scale'
    # 3. 'qzeros' to 'packed_zp'
    # 4. 'g_idx' to support HF GPTQ model
    ...

### Usage from User View ###
from neural_compressor import export_compressed_model
compressed_model = export_compressed_model('TheBloke/Llama-2-7B-Chat-GPTQ', ipex_format=True)
torch.save(compressed_model.state_dict(), "/path/to/model.pt")

import intel_extension_for_pytorch as ipex
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
ipex.optimize_transformers(ipex_woq_model.eval(), quantization_config=qconfig, low_precision_checkpoint='/path/to/model.pt')

This way has minimal impact on IPEX current WOQ implementation. But to support GPTQ like model, IPEX is lack of the g_idx support when group_size != -1 as well as the corresponding kernel. This is the feature gap existing in IPEX.

In INC, it will internally convert compression_dim and zp_dim to the default format IPEX supported, that's compressing weight along input_channel and storing zero point along output_channel.

Solution 2

Enhance IPEX to be directly compatible with latest & popular WOQ model format.

class IpexWoqLinear(nn.Module):    
    def from_float_and_int4_weight(
        cls, mod, qweight, scales, zero_points, bias=None, group_size=-1,
        group_idx=-1, compression_dim=0, zero_point_dim=1 ### new args needed to be supported by IPEX
    ):
        ...

DEFAULT_LOWP_CHECKPOINT_CONFIG = {
    "name": "default",
    "weight_key": "packed_weight",  ### need to be updated as 'qweight'
    "scale_key": "scale",           ### need to be updated as 'scales'
    "zero_point_key": "packed_zp",  ### need to be updated as 'qzeros'
    "bias_key": "bias",
    "g_idx_key": "g_idx"            ### new attribute to be supported
} 

### Usage from User View ###
model.load_state_dict(torch.load(PATH))
model.eval()
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
optimized_model = ipex.optimize_transformers(model, quantization_config=qconfig, low_precision_checkpoint='/path/to/woq/checkpoint')

In this solution, IPEX needs to be updated to cowork with latest/popular WOQ format in the industry.

Jiong Gong · Answer 1 · Wed Nov 29 2023 11:11:01 GMT+0800 (China Standard Time)

@ftian1 Thanks for the RFC.
I'd like to learn more about the "standard" woq format in huggingface before making a judgement call. Do you have document pointers to its format? What are the exact semantics for each field, in particular, the "group_idx", "compression dim" and "zero point dim"? Can you provide some examples for the standard format? You also mentioned that INC has some extension to the format. Appreciate if you can explain the semantics with examples too.

Tian, Feng · Answer 2 · Thu Nov 30 2023 10:23:50 GMT+0800 (China Standard Time)

@jgong5

"compression dim" means the 4bits weight compression direction, either along output channel or along input channel.

"zero point dim" means the 4bits zero point compression direction, same with "compression dim" it can be output channel or input channel.

as for "g_idx", it's used by GPTQ algorithm to record/restore shuffled group in input channel during quantization.

Jiong Gong · Answer 3 · Thu Nov 30 2023 13:15:17 GMT+0800 (China Standard Time)

"compression dim" means the 4bits weight compression direction, either along output channel or along input channel.

"zero point dim" means the 4bits zero point compression direction, same with "compression dim" it can be output channel or input channel.

Do these packed dims imply how the layout of the weights. For example, if the compression dim is along output channel, it requires the weights would be contiguous along the output dim and the weight is supposed to be KxN? Also, what is the benefit of allowing different compression direction?

as for "g_idx", it's used by GPTQ algorithm to record/restore shuffled group in input channel during quantization.

Not sure if I understand correctly, can we shuffle the scales/zps to make sure g_idx is always sorted so that we don't need an additional g_idx field?

Tian, Feng · Answer 4 · Fri Dec 01 2023 14:27:32 GMT+0800 (China Standard Time)

per offline discussion, we decided to take solution 2, that's IPEX to be directly compatible with HF format, INC will also generate HF compatible format in export_compressed_model().

Xia Weiwen · Answer 5 · Wed Dec 06 2023 17:24:29 GMT+0800 (China Standard Time)

Hi @ftian1. We have a question about g_idx. We know that g_idx is used for indexing of scales and zero points for groups. For example, we have weight shape = [64, 512], group size = 128. So, there are 4 groups along input channel. Scales/zero points shapes are [64, 4] and g_idx shape is also [64, 4].

Given g_idx[0] = [1, 0, 3, 2], it means that group i of output channel 0 should use the scales and zero points by scales[0][g_idx[i]] and zero_points[0][g_idx[i]], as listed below:

group 0: scales[0][1] and zero_points[0][1]
group 1: scales[0][0] and zero_points[0][0]
group 2: scales[0][3] and zero_points[0][3]
group 3: scales[0][2] and zero_points[0][2]

And the same goes for other output channels.
Our question is: can we shuffle the scales and zero points so that they are in the same order as groups? For example,

scales_shuffled = torch.empty_like(scales)
scales_shuffled[0][0] = scales[0][1]
scales_shuffled[0][1] = scales[0][0]
scales_shuffled[0][2] = scales[0][3]
scales_shuffled[0][3] = scales[0][2]

With the shuffled scales and zero points, g_idx is not needed anymore for computation. Is it doable? Are there any concerns about this? Thanks!

xinhe · Answer 6 · Wed Dec 13 2023 09:55:38 GMT+0800 (China Standard Time)

@Xia-Weiwen Thanks for raising that.
First, I want to correct the shape of g_idx, it's ~~[64, 512]~~ [512], while the values are from [0, 1, 2, 3]. This means that each channel has a g_idx to indicate which channel it belongs to. In your imagination, the channels are shuffled by group. In fact, it's shuffled by channel.

Xia Weiwen · Answer 7 · Wed Dec 13 2023 10:03:28 GMT+0800 (China Standard Time)

@Xia-Weiwen Thanks for raising that. First, I want to correct the shape of g_idx, it's [64, 512], while the values are from [0, 1, 2, 3]. This means that each channel has a g_idx to indicate which channel it belongs to. In your imagination, the channels are shuffled by group. In fact, it's shuffled by channel.

Hi @xin3he Thanks a lot for the explanation. Did you mean that all input channels are shuffled within an output channel? How to get the correct scale and zero point? scales[g_idx/group_size]?
And still the question: what difference does it make? Can we shuffle it back or not?

xinhe · Answer 8 · Wed Dec 13 2023 10:09:15 GMT+0800 (China Standard Time)

I think we may need to reshuffle the channels based on g_idx before performing dequantizaiton, then shuffle it back before performing matmul. This design from GPTQ is aiming to improve accuracy.

Jiong Gong · Answer 9 · Wed Dec 13 2023 15:46:34 GMT+0800 (China Standard Time)

First, I want to correct the shape of g_idx, it's [64, 512], while the values are from [0, 1, 2, 3]. This means that each channel has a g_idx to indicate which channel it belongs to. In your imagination, the channels are shuffled by group. In fact, it's shuffled by channel.

@xin3he So the g_idx has the same shape as weight? Would the overhead of loading g_idx be too big here? Suppose we have a group size of 64, each value in g_idx has to be 4-bit, making g_idx the same size as weight? Or did I missing anything?

xinhe · Answer 10 · Wed Dec 13 2023 15:48:15 GMT+0800 (China Standard Time)

Oh, sorry, my mistake @jgong5 . The shape of g_idx is [512], not [64, 512]. Only input channel is shuffled. The dtype of g_idx is torch.int32 in Optimum.

Xia Weiwen · Answer 11 · Wed Dec 13 2023 15:54:40 GMT+0800 (China Standard Time)

I think we may need to reshuffle the channels based on g_idx before performing dequantizaiton, then shuffle it back before performing matmul. This design from GPTQ is aiming to improve accuracy.

@xin3he Sorry. I don't understand. Why can't we shuffle input channels back to their original order after GPTQ and discard g_idx completely?

xinhe · Answer 12 · Wed Dec 13 2023 16:21:37 GMT+0800 (China Standard Time)

The quantization flow of weights using g_idx

@Xia-Weiwen Hope this figure can help you understand it.

Xia Weiwen · Answer 13 · Wed Dec 13 2023 16:27:51 GMT+0800 (China Standard Time)

@xin3he Thanks for the figure. It is much clearer now. So, we cannot get rid of g_idx. The final weights have input channels in the original order, but they belong to different groups now. Correct?

xinhe · Answer 14 · Wed Dec 13 2023 16:53:08 GMT+0800 (China Standard Time)

@xin3he Thanks for the figure. It is much clearer now. So, we cannot get rid of g_idx. The final weights have input channels in the original order, but they belong to different groups now. Correct?

Yes, exactly.