intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"OSError: [WinError -529697949] Windows Error 0xe06d7363 “ - While running llama3 on NPU

sujikarNStarx opened this issue · comments

Description:
While trying to perform a inference using NPU on llama3 model, following error shows up:
"OSError: [WinError -529697949] Windows Error 0xe06d7363 “ .
We are able to infer using tinnyllama and phi 3. We are using the code provided in examples.

Code Snapshot:
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
import intel_npu_acceleration_library
import torch
import os
import transformers

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
dtype = "float16"

PATH = os.path.join("models", model_id, dtype)
filename = os.path.join(PATH, "model.pth")
os.makedirs(PATH, exist_ok=True)

if not os.path.exists(filename):
print("Compile model for the NPU")
model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
torch_dtype = torch.int8 if dtype == "int8" else torch.float16
with torch.no_grad():
model = intel_npu_acceleration_library.compile(model, dtype=torch_dtype)
torch.save(model, filename)
del model

print(f"Loading model from {filename}")

#print(tokenizer)

model = torch.load(filename).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)
pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token_id = pad_token_id

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)

print("Run inference with Llama3 on NPU\n")

query = input(">")

messages = [
{
"role": "system",
"content": "You are an helpful chatbot that can provide information about the Intel NPU",
},
{"role": "user", "content": query},
]

input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]

print(input_ids)
outputs = model.generate(
input_ids,
max_new_tokens=256,
temperature = 0.6,
top_p =1,
top_k =1,
repetition_penalty = 1,
eos_token_id=terminators,
do_sample=True,
streamer=streamer,
)

The Output till the error shows up

(eucapp) C:\Users\SHI-Labs-02\Documents\euc_hostess\NPU-backend>python npu_llama3.py
Loading model from models\meta-llama/Meta-Llama-3-8B-Instruct\float16\model.pth
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Run inference with Llama3 on NPU

tell me a joke
tensor([[128000, 128006, 9125, 128007, 271, 2675, 527, 459, 11190,
6369, 6465, 430, 649, 3493, 2038, 922, 279, 15984,
452, 6459, 128009, 128006, 882, 128007, 271, 73457, 757,
264, 22380, 128009, 128006, 78191, 128007, 271]])
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128009 for open-end generation.

The Error
Traceback (most recent call last):
File "C:\Users\SHI-Labs-02\Documents\euc_hostess\NPU-backend\npu_llama3.py", line 67, in
outputs = model.generate(
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\transformers\generation\utils.py", line 1758, in generate
result = self._sample(
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\transformers\generation\utils.py", line 2397, in _sample
outputs = self(
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\transformers\models\llama\modeling_llama.py", line 1183, in forward
logits = self.lm_head(hidden_states)
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\intel_npu_acceleration_library\nn\linear.py", line 45, in forward
out = run_matmul(x, self.weight, None, self.op_id)
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\intel_npu_acceleration_library\backend\runtime.py", line 97, in run_matmul
_model_cache[key] = deque([op_class(inC, outC, batch)])
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\intel_npu_acceleration_library\backend\linear.py", line 32, in init
self.compile(out)
File "C:\Users\SHI-Labs-02\miniconda3\envs\eucapp\lib\site-packages\intel_npu_acceleration_library\backend\factory.py", line 146, in compile
backend_lib.compile(self._mm, output_node)
OSError: [WinError -529697949] Windows Error 0xe06d7363

Compute snapshot:
Screenshot 2024-06-03 at 10 17 10 AM

Additional context
Want to know if there is a way out through tweaking of any hyper parameters?

Try to use the latest drivers from here: https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html

Also try to use int8 as it is a big model that really stretch the system capabilities in float16

@alessandropalla Thank you for the suggestion. It worked..once drivers were updated, we are able to run llama3 on NPU