Failed to run Phi-3-mini-4k-instruct (int4) on Windows

Question

Failed to run Phi-3-mini-4k-instruct (int4) on Windows

xyang2013 opened this issue 16 days ago · comments

xyang2013 commented 16 days ago

Meteor Lake 155H
16GB
Windows 11

Code

from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
import intel_npu_acceleration_library
import torch

model_id = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
tokenizer.pad_token_id = tokenizer.eos_token_id
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=intel_npu_acceleration_library.int4)

query = input("Ask something: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]

generation_kwargs = dict(
    input_ids=prefix,
    streamer=streamer,
    do_sample=True,
    top_k=50,
    top_p=0.9,
    max_new_tokens=512,
)

print("Run inference")
_ = model.generate(**generation_kwargs)

Error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[20], [line 2](vscode-notebook-cell:?execution_count=20&line=2)
      [1](vscode-notebook-cell:?execution_count=20&line=1) print("Run inference")
----> [2](vscode-notebook-cell:?execution_count=20&line=2) _ = model.generate(**generation_kwargs)

File c:\Users\xiaoy\anaconda3\envs\nlp\Lib\site-packages\torch\utils\_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    [112](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/torch/utils/_contextlib.py:112) @functools.wraps(func)
    [113](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/torch/utils/_contextlib.py:113) def decorate_context(*args, **kwargs):
    [114](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/torch/utils/_contextlib.py:114)     with ctx_factory():
--> [115](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/torch/utils/_contextlib.py:115)         return func(*args, **kwargs)

File c:\Users\xiaoy\anaconda3\envs\nlp\Lib\site-packages\transformers\generation\utils.py:1758, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   [1750](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1750)     input_ids, model_kwargs = self._expand_inputs_for_generation(
   [1751](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1751)         input_ids=input_ids,
   [1752](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1752)         expand_size=generation_config.num_return_sequences,
   [1753](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1753)         is_encoder_decoder=self.config.is_encoder_decoder,
   [1754](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1754)         **model_kwargs,
   [1755](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1755)     )
   [1757](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1757)     # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> [1758](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1758)     result = self._sample(
   [1759](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1759)         input_ids,
   [1760](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1760)         logits_processor=prepared_logits_processor,
   [1761](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1761)         logits_warper=prepared_logits_warper,
   [1762](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/transformers/generation/utils.py:1762)         stopping_criteria=prepared_stopping_criteria,
...
--> [147](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/intel_npu_acceleration_library/backend/factory.py:147)     backend_lib.compile(self._mm, output_node)
    [148](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/intel_npu_acceleration_library/backend/factory.py:148)     self.output_shape = self.get_output_tensor_shape()
    [149](file:///C:/Users/xiaoy/anaconda3/envs/nlp/Lib/site-packages/intel_npu_acceleration_library/backend/factory.py:149)     if len(self.output_shape) != 2:

OSError: [WinError -529697949] Windows Error 0xe06d7363
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?1c1cdbbc-fc83-4157-b0d9-043eb9ec8946) or open in a [text editor](command:workbench.action.openLargeOutput?1c1cdbbc-fc83-4157-b0d9-043eb9ec8946). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

Alessandro Palla · Answer 1 · Thu Jun 06 2024 19:58:32 GMT+0800 (China Standard Time)

I just ran your code with no problem. What driver version do you have?

xyang2013 · Answer 2 · Thu Jun 06 2024 20:01:38 GMT+0800 (China Standard Time)

I recently installed Windows 11 and always keep it up to date.

Intel(R) AI Boost
Driver version: 31.0.100.1688

Thank you.

Alessandro Palla · Answer 3 · Thu Jun 06 2024 20:03:56 GMT+0800 (China Standard Time)

Very old, it is no surprise it doesn't work, as int4 support for this library was enabled only in the latest release (32.0.100.2408)

Please install the new version: https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html

xyang2013 · Answer 4 · Thu Jun 06 2024 20:15:12 GMT+0800 (China Standard Time)

Thank you. I try it now. Is there any way to ensure Microsoft provide up-to-date driver via their Windows update? If it happens to me, it will be the case for many people as I already try to keep the system up to date.

xyang2013 · Answer 5 · Thu Jun 06 2024 21:36:59 GMT+0800 (China Standard Time)

It runs. But running on NPU is very slow. I tried to run the same model (not sure about quantization) via Ollama on the CPU. It is very fast. Could you explain why that is the case? Based on the information about Lunar Lake, the CPU only has 5 TOPs which is significantly less than the 10TOPS NPU of Meter Lake. I am not sure why. Is there a developer guide for the NPU?

xyang2013 · Answer 6 · Thu Jun 06 2024 22:23:16 GMT+0800 (China Standard Time)

I experienced it a bit. I found out the battery power profile affects the performance of the NPU. Is there a guide line on how NPU should be used? Thank you.

Alessandro Palla · Answer 7 · Thu Jun 06 2024 22:30:56 GMT+0800 (China Standard Time)

I experienced it a bit. I found out the battery power profile affects the performance of the NPU. Is there a guide line on how NPU should be used? Thank you.

Yes, and also the CPU SKU makes a lot of difference. Also we are just started working on pushing performance for these models much higher. Compared to a C++ implementation like Ollama, that has very hand-custom optimized kernels, here we use a pytorch eager mode backend that is less efficient. Also we have a bunch of work both in driver, openvino and in this library to improve performance like using OV remote tensors (that will slash the rt overhead that is slowing us down for this library), as well as driver improvements.

Also, to understand LLM performance please have a look at this document: https://intel.github.io/intel-npu-acceleration-library/llm_performance.html

Based on the information about Lunar Lake, the CPU only has 5 TOPs which is significantly less than the 10TOPS NPU of Meter Lake.

Lunarlake has not been released yet tho...

Thank you. I try it now. Is there any way to ensure Microsoft provide up-to-date driver via their Windows update? If it happens to me, it will be the case for many people as I already try to keep the system up to date.

I agree, the windows updates are pushed by OEMs (Asus, Dell etc..) so there is no guarantee when they will be available. But we can check the driver installed and notify if there is something odd there

xyang2013 · Answer 8 · Thu Jun 06 2024 22:50:32 GMT+0800 (China Standard Time)

Thanks. The other question that I have is whether in the future, "unified memory" can be supported?

Alessandro Palla · Answer 9 · Thu Jun 06 2024 22:57:47 GMT+0800 (China Standard Time)

Sorry, I'm not allowed to discuss about future products