intel / intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

neuralchat int4 quantization failing during inference

kta-intel opened this issue · comments

Running the following code:

from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig
from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
from intel_extension_for_transformers.neural_chat.config import LoadingModelConfig
config = PipelineConfig(model_name_or_path="Intel/neural-chat-7b-v3-1",
                        optimization_config=WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4_fullrange"), 
                        loading_config=LoadingModelConfig(use_llm_runtime=False))
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)

Results in error:

2024-02-07 12:07:33,722 - root - ERROR - model.generate exception: If `eos_token_id` is defined, make sure that `pad_token_id` is defined.
2024-02-07 12:07:33 [ERROR] neuralchat error: Model inference failed

Seems the model quantizes and builds chatbot successfully, but errors out during chatbot.predict()