NPU compiled Llama3 (int4) model not working if prompt size is large
sujikarNStarx opened this issue · comments
While, we are able to generate output for shorter prompts using NPU compiled llama3 model. If the prompt size is large, the model doesn’t generate any output.
Needed Support
-Is there a prefixed prompt size?
- How can we increase the allowed prompt size?
- How to use the maximum context length ?
- If we are missing any hyper parameter, please suggest
Code used
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import intel_npu_acceleration_library
import torch
import os
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
dtype = "int4"
PATH = os.path.join("models", model_id, dtype)
filename = os.path.join(PATH, "model.pth")
os.makedirs(PATH, exist_ok=True)
if not os.path.exists(filename):
print("Compile model for the NPU")
model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
torch_dtype = torch.int8 if dtype == "int4" else torch.float16
with torch.no_grad():
model = intel_npu_acceleration_library.compile(model, dtype=torch_dtype)
torch.save(model, filename)
del model
print(f"Loading model from {filename}")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = torch.load(filename).eval()
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)
print("Run inference with Llama3 on NPU\n")
query = input(">”)
DEFAULT_normal_PROMPT = """ You are a helpful, respectful, and honest Hostess for the EUC Conference 2024.
2024. Your main role is to provide information and support to attendees, ensuring clarity,
accuracy, and a welcoming tone.
Conference Agenda:
DAY 1 AGENDA:
Tuesday, June 11, 2024
Knox and Ridge facilities Tour: Transfer slot 1: 8:30 am;
Transfer slot 2: 9:30 am; Transfer slot 3:10:20 am; Transfer slot 4:11:20 am; Transfer slot 5:12:00 pm; Transfer slot 6:1:00 pm; Transfer slot 7:1:15 pm.
● Transfers from Delta Marriott to Knox and Ridge Facilities for those who opted to join the tour.
● Transportation from the Delta Marriott to the Knox and Ridge Facilities will be 30 minutes prior to tour start time
Location: Delta Marriott Lobb
Knox and Ridge tour registration timings : 7:15 am. to 3:30 pm.
● Registration
● Knox and Ridge tour Registration Location: Delta Marriott Lobby
Day 1 Breakfast Timings : 7:30 am. to 9:30 am.
● Continental Breakfast
Day 1 Breakfast Location: Delta Marriott
Knox & Ridge Tour Timings : 8:30 am to 3:40 pm.
Knox & Ridge Integration Center Tours to Optional
(You must be pre-registered. Please review your personalized appointment for your specific tour time)
1. Tour 1 8:30 am. to 10:55 am. 2. Tour 2 9:30 am. to 11:55 am. 3. Tour 3 10:20 am. to 12:35 pm. 4. Tour 4 11:20 am. to 1:35 pm. 5. Tour 5 12:00 pm. to 2:25 pm. 6. Tour 6 1:00 pm. to 3:25 pm. 7. Tour 7: 1:15 pm. to 3:40 pm.
Knox :
Discover why the world largest organizations work with SHI to accelerate time-to-value for end-user computing investments. Experience the thrill and see first-hand how we help organizations improve employee services, streamline the supply chain, and achieve financial, operational, and sustainability goals. Get your popcorn ready and enjoy the show as SHI shows off how we can create a world of services, just for you!
Ridge:
Home to the SHI Hardware Life cycle Management and Integrated Data Center Solutions offerings, SHI Ridge is a 400,000 sq ft facility for organizations that require End-to-end life cycle services.
Location: Knox & Ridge Integration Center Tours
"""
messages = [
{
"role": "system",
#"content": DEFAULT_RAG_PROMPT,
"content": DEFAULT_normal_PROMPT,
},
{"role": "user", "content": query},
]
input_ids = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, max_length= 7200, return_tensors="pt"
).to(model.device)
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
outputs = model.generate(
input_ids,
max_length = 7200,
#max_new_tokens=256,
eos_token_id=terminators,
do_sample=False,
streamer=streamer,
)
Error
No error shows up but the response is also not generated.
I can reproduce your issue. The point is that the llamav3 language model head is so big it takes a lot of time to compile that completely degrade user experience... We are working to improve compilation time of such large kernels. I'll keep you posted
Noted please. Thanks for swift response. I will wait for new updates.
I'm seeing a similar issue even with llama2-7b
. I can run with sequence length=128, but seq_len=2048 hangs and never returns.
I was able to run this same example successfully about a month ago with a previous driver version.
Steps to reproduce
python profile_llm.py --model meta-llama/Llama-2-7b-hf --context-size 2048 --dtype int8