intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AttributeError: property 'pad_token_id' of 'ChatGLMTokenizer' object has no setter

andyluo7 opened this issue · comments

Describe the bug
An error happened when running profile_llm.py for chatglm3-6b. Other models such as llama2 7b and qwen 7b works on NPU.

To Reproduce
Steps to reproduce the behavior:

  1. (tmp) C:\Users\andyl\intel-npu-acceleration-library\script>python profile_llm.py --dtype float16 --device NPU --model THUDM/chatglm3-6b --context-size 128

Expected behavior
Should be able to generate the benchmark result

Screenshots

Profiling THUDM/chatglm3-6b with context size 128
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Traceback (most recent call last):
File "C:\Users\andyl\intel-npu-acceleration-library\script\profile_llm_orig.py", line 153, in
main(
File "C:\Users\andyl\intel-npu-acceleration-library\script\profile_llm_orig.py", line 28, in main
tokenizer.pad_token_id = tokenizer.eos_token_id
^^^^^^^^^^^^^^^^^^^^^^
AttributeError: property 'pad_token_id' of 'ChatGLMTokenizer' object has no setter

Desktop (please complete the following information):

  • OS: Windows 11
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Hi,
ChatGLM has a tokenizer that differs from standard transformers one so needs to be handled with care:

  1. ChatGLMTokenizer has already pad token so line 28 of profile_llm.py can be commented out link
  2. ChatGLM apparently does not support static shape inference, and it is how we try to get consistent profiling.

If you want to run it you can slightly modify the script in the example as below (I just verified it works)

from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
import intel_npu_acceleration_library
import torch

model_id = "THUDM/chatglm3-6b"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True, trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)


print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)

query = input("Ask something: ")
prefix = tokenizer(query, return_tensors="pt")["input_ids"]


generation_kwargs = dict(
    input_ids=prefix,
    streamer=streamer,
    do_sample=True,
    top_k=50,
    top_p=0.9,
    max_new_tokens=512,
)

print("Run inference")
_ = model.generate(**generation_kwargs)

Please refer to this document that to understand models performances and consider that this library is still WIP so performance are expected to significantly increase with library and driver next releases :)

@alessandropalla , thanks for answering. I can run the example with chatglm3-6b with script modification but profile_llm.py still not working. Is it possible to update profile_llm.py to support chatglm3-6b?