[Bug] int8 quantization not working except tiny llama

Question

[Bug] int8 quantization not working except tiny llama

invent00 opened this issue 2 months ago · comments

Describe the bug
int8 inference for the following models does not work well on ver 1.1.0
(float16 are working properly)

test models:
google/gemma-1.1-2b-it
pankajmathur/orca_mini_3b
mistralai/Mistral-7B-Instruct-v0.2

their int8 inference are work properly on ver 1.0.0
Are there plans to narrow down the supported models for quantization?

To Reproduce
Steps to reproduce the behavior:

install v1.1 library via pip
change model_id in sample code
run sample code.

Expected behavior
works properly

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Windows 11 23H2
NPU driver version: 32.0.100.2267
CPU: Core Ultra 5 125U

pip environments

certifi==2024.2.2
charset-normalizer==3.3.2
colorama==0.4.6
filelock==3.14.0
fsspec==2024.5.0
huggingface-hub==0.23.2
idna==3.7
intel-npu-acceleration-library==1.1.0
intel-openmp==2021.4.0
Jinja2==3.1.4
MarkupSafe==2.1.5
mkl==2021.4.0
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.4
packaging==24.0
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.2
safetensors==0.4.3
sympy==1.12
tbb==2021.12.0
tokenizers==0.19.1
torch==2.3.0
tqdm==4.66.4
transformers==4.41.1
typing_extensions==4.12.0
urllib3==2.2.1

Alessandro Palla commented 2 months ago

Solved

Alessandro Palla · Answer 1 · Tue May 28 2024 22:19:47 GMT+0800 (China Standard Time)

let me try to reproduce it

Alessandro Palla · Answer 2 · Tue May 28 2024 22:24:47 GMT+0800 (China Standard Time)

Try to use the latest driver version: https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html

invent00 · Answer 3 · Tue May 28 2024 22:41:55 GMT+0800 (China Standard Time)

Thank you for your quick response.

I updated NPU driver version to 32.0.100.2408, but same error occur.

Alessandro Palla · Answer 4 · Tue May 28 2024 22:50:54 GMT+0800 (China Standard Time)

Ok I can reproduce the issue myself, fixes will be part of #32 as we are going to switch from this library very naive quantization scheme to neural-compressor