intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] int8 quantization not working except tiny llama

invent00 opened this issue · comments

Describe the bug
int8 inference for the following models does not work well on ver 1.1.0
(float16 are working properly)

test models:
google/gemma-1.1-2b-it
pankajmathur/orca_mini_3b
mistralai/Mistral-7B-Instruct-v0.2

their int8 inference are work properly on ver 1.0.0
Are there plans to narrow down the supported models for quantization?

To Reproduce
Steps to reproduce the behavior:

  1. install v1.1 library via pip
  2. change model_id in sample code
  3. run sample code.

Expected behavior
works properly

Screenshots
If applicable, add screenshots to help explain your problem.
image

Desktop (please complete the following information):

  • OS: Windows 11 23H2
  • NPU driver version: 32.0.100.2267
  • CPU: Core Ultra 5 125U

pip environments

certifi==2024.2.2
charset-normalizer==3.3.2
colorama==0.4.6
filelock==3.14.0
fsspec==2024.5.0
huggingface-hub==0.23.2
idna==3.7
intel-npu-acceleration-library==1.1.0
intel-openmp==2021.4.0
Jinja2==3.1.4
MarkupSafe==2.1.5
mkl==2021.4.0
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.4
packaging==24.0
PyYAML==6.0.1
regex==2024.5.15
requests==2.32.2
safetensors==0.4.3
sympy==1.12
tbb==2021.12.0
tokenizers==0.19.1
torch==2.3.0
tqdm==4.66.4
transformers==4.41.1
typing_extensions==4.12.0
urllib3==2.2.1

let me try to reproduce it

Thank you for your quick response.

I updated NPU driver version to 32.0.100.2408, but same error occur.

Ok I can reproduce the issue myself, fixes will be part of #32 as we are going to switch from this library very naive quantization scheme to neural-compressor