[Bug] int8 quantization not working except tiny llama
invent00 opened this issue · comments
Describe the bug
int8 inference for the following models does not work well on ver 1.1.0
(float16 are working properly)
test models:
google/gemma-1.1-2b-it
pankajmathur/orca_mini_3b
mistralai/Mistral-7B-Instruct-v0.2
their int8 inference are work properly on ver 1.0.0
Are there plans to narrow down the supported models for quantization?
To Reproduce
Steps to reproduce the behavior:
- install v1.1 library via pip
- change model_id in sample code
- run sample code.
Expected behavior
works properly
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: Windows 11 23H2
- NPU driver version: 32.0.100.2267
- CPU: Core Ultra 5 125U
pip environments
certifi==2024.2.2
charset-normalizer==3.3.2
colorama==0.4.6
filelock==3.14.0
fsspec==2024.5.0
huggingface-hub==0.23.2
idna==3.7
intel-npu-acceleration-library==1.1.0
intel-openmp==2021.4.0
Jinja2==3.1.4
MarkupSafe==2.1.5
mkl==2021.4.0
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.4
packaging==24.0
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.2
safetensors==0.4.3
sympy==1.12
tbb==2021.12.0
tokenizers==0.19.1
torch==2.3.0
tqdm==4.66.4
transformers==4.41.1
typing_extensions==4.12.0
urllib3==2.2.1
let me try to reproduce it
Try to use the latest driver version: https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html
Thank you for your quick response.
I updated NPU driver version to 32.0.100.2408, but same error occur.
Ok I can reproduce the issue myself, fixes will be part of #32 as we are going to switch from this library very naive quantization scheme to neural-compressor
Solved