intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failed to run Phi-3-mini-4k-instruct (int4) on Windows

xyang2013 opened this issue · comments

Meteor Lake 155H
16GB
Windows 11

Code

from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
import intel_npu_acceleration_library
import torch

model_id = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=intel_npu_acceleration_library.int4)

query = input("Ask something: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]

generation_kwargs = dict(
    input_ids=prefix,
    streamer=streamer,
    do_sample=True,
    top_k=50,
    top_p=0.9,
    max_new_tokens=512,
)

print("Run inference")
_ = model.generate(**generation_kwargs)

Error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[20], [line 2](vscode-notebook-cell:?execution_count=20&line=2)
      [1](vscode-notebook-cell:?execution_count=20&line=1) print("Run inference")
----> [2](vscode-notebook-cell:?execution_count=20&line=2) _ = model.generate(**generation_kwargs)

File c:\Users\xiaoy\anaconda3\envs\nlp\Lib\site-packages\torch\utils\_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    [112](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/torch/utils/_contextlib.py:112) @functools.wraps(func)
    [113](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/torch/utils/_contextlib.py:113) def decorate_context(*args, **kwargs):
    [114](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/torch/utils/_contextlib.py:114)     with ctx_factory():
--> [115](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/torch/utils/_contextlib.py:115)         return func(*args, **kwargs)

File c:\Users\xiaoy\anaconda3\envs\nlp\Lib\site-packages\transformers\generation\utils.py:1758, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   [1750](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1750)     input_ids, model_kwargs = self._expand_inputs_for_generation(
   [1751](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1751)         input_ids=input_ids,
   [1752](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1752)         expand_size=generation_config.num_return_sequences,
   [1753](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1753)         is_encoder_decoder=self.config.is_encoder_decoder,
   [1754](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1754)         **model_kwargs,
   [1755](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1755)     )
   [1757](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1757)     # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> [1758](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1758)     result = self._sample(
   [1759](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1759)         input_ids,
   [1760](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1760)         logits_processor=prepared_logits_processor,
   [1761](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1761)         logits_warper=prepared_logits_warper,
   [1762](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1762)         stopping_criteria=prepared_stopping_criteria,
...
--> [147](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/intel_npu_acceleration_library/backend/factory.py:147)     backend_lib.compile(self._mm, output_node)
    [148](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/intel_npu_acceleration_library/backend/factory.py:148)     self.output_shape = self.get_output_tensor_shape()
    [149](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/intel_npu_acceleration_library/backend/factory.py:149)     if len(self.output_shape) != 2:

OSError: [WinError -529697949] Windows Error 0xe06d7363
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?1c1cdbbc-fc83-4157-b0d9-043eb9ec8946) or open in a [text editor](command:workbench.action.openLargeOutput?1c1cdbbc-fc83-4157-b0d9-043eb9ec8946). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...







I just ran your code with no problem. What driver version do you have?

I recently installed Windows 11 and always keep it up to date.

Intel(R) AI Boost
Driver version: 31.0.100.1688

Thank you.

Very old, it is no surprise it doesn't work, as int4 support for this library was enabled only in the latest release (32.0.100.2408)

Please install the new version: https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html

Thank you. I try it now. Is there any way to ensure Microsoft provide up-to-date driver via their Windows update? If it happens to me, it will be the case for many people as I already try to keep the system up to date.

It runs. But running on NPU is very slow. I tried to run the same model (not sure about quantization) via Ollama on the CPU. It is very fast. Could you explain why that is the case? Based on the information about Lunar Lake, the CPU only has 5 TOPs which is significantly less than the 10TOPS NPU of Meter Lake. I am not sure why. Is there a developer guide for the NPU?

I experienced it a bit. I found out the battery power profile affects the performance of the NPU. Is there a guide line on how NPU should be used? Thank you.

I experienced it a bit. I found out the battery power profile affects the performance of the NPU. Is there a guide line on how NPU should be used? Thank you.

Yes, and also the CPU SKU makes a lot of difference. Also we are just started working on pushing performance for these models much higher. Compared to a C++ implementation like Ollama, that has very hand-custom optimized kernels, here we use a pytorch eager mode backend that is less efficient. Also we have a bunch of work both in driver, openvino and in this library to improve performance like using OV remote tensors (that will slash the rt overhead that is slowing us down for this library), as well as driver improvements.

Also, to understand LLM performance please have a look at this document: https://intel.github.io/intel-npu-acceleration-library/llm_performance.html

Based on the information about Lunar Lake, the CPU only has 5 TOPs which is significantly less than the 10TOPS NPU of Meter Lake.

Lunarlake has not been released yet tho...

Thank you. I try it now. Is there any way to ensure Microsoft provide up-to-date driver via their Windows update? If it happens to me, it will be the case for many people as I already try to keep the system up to date.

I agree, the windows updates are pushed by OEMs (Asus, Dell etc..) so there is no guarantee when they will be available. But we can check the driver installed and notify if there is something odd there

Thanks. The other question that I have is whether in the future, "unified memory" can be supported?

Sorry, I'm not allowed to discuss about future products