intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

Home Page:https://intel.github.io/neural-compressor/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC

ftian1 opened this issue · comments

This RFC is to propose a Hugging Face-compatible yet flexible Weight Only Quantization (WOQ) format in INC, and then the model quantized by INC can be loaded by IPEX for further inference optimization.

Feature, Motivation and Pitch

As we know, WOQ is getting more attentions from the industry. There have had a lot of quantized WOQ models, like Llama-2-7B-Chat-GPTQ, whose format has been becoming the standard WOQ storage format. Therefore, we propose a Hugging Face-compatible, yet flexible WOQ format definition. With this, we can leverage community effort to get those WOQ models and can also easily extend for new WOQ algorithms in the future which may keep improving on the accuracy of LLMs.

Design

The WOQ quantized model is usually saved at HuggingFace model hub like below layout:
image

User needs a quantization_config to know which group_size, desc_act, and sym is used when generating such WOQ model. however, such info/fields are able to be calculated from the WOQ checkpoint's content.

So the WOQ checkpoint format is the key factor to consider. it's mainly consists of two parts:

  1. checkpoint attributes like packed weight, scale, zero_points, group_idx (De facto standard in HuggingFace WOQ model)
  2. how packed weight gets compressed, like compress dimension, zero point dimension (hardcoded in HuggingFace WOQ model but INC can be more flexible to generate such packed model)

NOTE: The fields marked with bold font is the one we are missing in current IPEX code.

In the industry, the common practice is to save the first part into model checkpoint. For the second part, output channel for compression dimension and input channel for zero point dimension is the default behavior. INC extends the second part to support input channel as compression dimension and output channel as zero point dimension. This extension can be converted to follow the default dimension.

Solutions

Solution 1 (Recommended)

Enhance INC to export the converted model format which can be identified by current IPEX implementation.

### INC export interface
def export_compressed_model(woq_model, ipex_format=True):
    ### convert WOQ model to be IPEX compatible one
    #
    # the converted checkpoint attributes include:
    #
    # 1. 'qweight' to 'packed_weight'
    # 2. 'scales' to 'scale'
    # 3. 'qzeros' to 'packed_zp'
    # 4. 'g_idx' to support HF GPTQ model
    ...

### Usage from User View ###
from neural_compressor import export_compressed_model
compressed_model = export_compressed_model('TheBloke/Llama-2-7B-Chat-GPTQ', ipex_format=True)
torch.save(compressed_model.state_dict(), "/path/to/model.pt")

import intel_extension_for_pytorch as ipex
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
ipex.optimize_transformers(ipex_woq_model.eval(), quantization_config=qconfig, low_precision_checkpoint='/path/to/model.pt')

This way has minimal impact on IPEX current WOQ implementation. But to support GPTQ like model, IPEX is lack of the g_idx support when group_size != -1 as well as the corresponding kernel. This is the feature gap existing in IPEX.

In INC, it will internally convert compression_dim and zp_dim to the default format IPEX supported, that's compressing weight along input_channel and storing zero point along output_channel.

Solution 2

Enhance IPEX to be directly compatible with latest & popular WOQ model format.

class IpexWoqLinear(nn.Module):    
    def from_float_and_int4_weight(
        cls, mod, qweight, scales, zero_points, bias=None, group_size=-1,
        group_idx=-1, compression_dim=0, zero_point_dim=1 ### new args needed to be supported by IPEX
    ):
        ...

DEFAULT_LOWP_CHECKPOINT_CONFIG = {
    "name": "default",
    "weight_key": "packed_weight",  ### need to be updated as 'qweight'
    "scale_key": "scale",           ### need to be updated as 'scales'
    "zero_point_key": "packed_zp",  ### need to be updated as 'qzeros'
    "bias_key": "bias",
    "g_idx_key": "g_idx"            ### new attribute to be supported
} 

### Usage from User View ###
model.load_state_dict(torch.load(PATH))
model.eval()
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
optimized_model = ipex.optimize_transformers(model, quantization_config=qconfig, low_precision_checkpoint='/path/to/woq/checkpoint')

In this solution, IPEX needs to be updated to cowork with latest/popular WOQ format in the industry.

@ftian1 Thanks for the RFC.
I'd like to learn more about the "standard" woq format in huggingface before making a judgement call. Do you have document pointers to its format? What are the exact semantics for each field, in particular, the "group_idx", "compression dim" and "zero point dim"? Can you provide some examples for the standard format? You also mentioned that INC has some extension to the format. Appreciate if you can explain the semantics with examples too.

@jgong5

"compression dim" means the 4bits weight compression direction, either along output channel or along input channel.

"zero point dim" means the 4bits zero point compression direction, same with "compression dim" it can be output channel or input channel.

as for "g_idx", it's used by GPTQ algorithm to record/restore shuffled group in input channel during quantization.
image

"compression dim" means the 4bits weight compression direction, either along output channel or along input channel.

"zero point dim" means the 4bits zero point compression direction, same with "compression dim" it can be output channel or input channel.

Do these packed dims imply how the layout of the weights. For example, if the compression dim is along output channel, it requires the weights would be contiguous along the output dim and the weight is supposed to be KxN? Also, what is the benefit of allowing different compression direction?

as for "g_idx", it's used by GPTQ algorithm to record/restore shuffled group in input channel during quantization.

Not sure if I understand correctly, can we shuffle the scales/zps to make sure g_idx is always sorted so that we don't need an additional g_idx field?

per offline discussion, we decided to take solution 2, that's IPEX to be directly compatible with HF format, INC will also generate HF compatible format in export_compressed_model().

Hi @ftian1. We have a question about g_idx. We know that g_idx is used for indexing of scales and zero points for groups. For example, we have weight shape = [64, 512], group size = 128. So, there are 4 groups along input channel. Scales/zero points shapes are [64, 4] and g_idx shape is also [64, 4].

Given g_idx[0] = [1, 0, 3, 2], it means that group i of output channel 0 should use the scales and zero points by scales[0][g_idx[i]] and zero_points[0][g_idx[i]], as listed below:

  • group 0: scales[0][1] and zero_points[0][1]
  • group 1: scales[0][0] and zero_points[0][0]
  • group 2: scales[0][3] and zero_points[0][3]
  • group 3: scales[0][2] and zero_points[0][2]

And the same goes for other output channels.
Our question is: can we shuffle the scales and zero points so that they are in the same order as groups? For example,

scales_shuffled = torch.empty_like(scales)
scales_shuffled[0][0] = scales[0][1]
scales_shuffled[0][1] = scales[0][0]
scales_shuffled[0][2] = scales[0][3]
scales_shuffled[0][3] = scales[0][2]

With the shuffled scales and zero points, g_idx is not needed anymore for computation. Is it doable? Are there any concerns about this? Thanks!

@Xia-Weiwen Thanks for raising that.
First, I want to correct the shape of g_idx, it's [64, 512] [512], while the values are from [0, 1, 2, 3]. This means that each channel has a g_idx to indicate which channel it belongs to. In your imagination, the channels are shuffled by group. In fact, it's shuffled by channel.

@Xia-Weiwen Thanks for raising that. First, I want to correct the shape of g_idx, it's [64, 512], while the values are from [0, 1, 2, 3]. This means that each channel has a g_idx to indicate which channel it belongs to. In your imagination, the channels are shuffled by group. In fact, it's shuffled by channel.

Hi @xin3he Thanks a lot for the explanation. Did you mean that all input channels are shuffled within an output channel? How to get the correct scale and zero point? scales[g_idx/group_size]?
And still the question: what difference does it make? Can we shuffle it back or not?

I think we may need to reshuffle the channels based on g_idx before performing dequantizaiton, then shuffle it back before performing matmul. This design from GPTQ is aiming to improve accuracy.

First, I want to correct the shape of g_idx, it's [64, 512], while the values are from [0, 1, 2, 3]. This means that each channel has a g_idx to indicate which channel it belongs to. In your imagination, the channels are shuffled by group. In fact, it's shuffled by channel.

@xin3he So the g_idx has the same shape as weight? Would the overhead of loading g_idx be too big here? Suppose we have a group size of 64, each value in g_idx has to be 4-bit, making g_idx the same size as weight? Or did I missing anything?

Oh, sorry, my mistake @jgong5 . The shape of g_idx is [512], not [64, 512]. Only input channel is shuffled. The dtype of g_idx is torch.int32 in Optimum.

I think we may need to reshuffle the channels based on g_idx before performing dequantizaiton, then shuffle it back before performing matmul. This design from GPTQ is aiming to improve accuracy.

@xin3he Sorry. I don't understand. Why can't we shuffle input channels back to their original order after GPTQ and discard g_idx completely?

The quantization flow of weights using g_idx
image
@Xia-Weiwen Hope this figure can help you understand it.

@xin3he Thanks for the figure. It is much clearer now. So, we cannot get rid of g_idx. The final weights have input channels in the original order, but they belong to different groups now. Correct?

@xin3he Thanks for the figure. It is much clearer now. So, we cannot get rid of g_idx. The final weights have input channels in the original order, but they belong to different groups now. Correct?

Yes, exactly.