intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

Home Page:https://intel.github.io/neural-compressor/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to perform int8 quantisation (not uint8) using ONNX?

paul-ang opened this issue · comments

Hi team, I am having issue quantizing the network consisting of Conv and Linear layers using int8 weights and activations in ONNX. I have tried setting it using op_type_dict, however it doesn't work. The activation is still using uint8. I am using version 2.3.1 neural compressor.

Hi @paul-ang , we only support U8S8 by default because on x86-64 machines with AVX2 and AVX512 extensions, ONNX Runtime uses the VPMADDUBSW instruction for U8S8 for performance. I am so sorry you need to update the code by yourself to use S8S8. Please add 'int8' in activations' dtype list: https://github.com/intel/neural-compressor/blob/master/neural_compressor/adaptor/onnxrt.yaml.
We will enhance it in our 3.0 API.