neuralchat int4 quantization failing during inference
kta-intel opened this issue · comments
Kevin Ta commented
Running the following code:
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig
from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
from intel_extension_for_transformers.neural_chat.config import LoadingModelConfig
config = PipelineConfig(model_name_or_path="Intel/neural-chat-7b-v3-1",
optimization_config=WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4_fullrange"),
loading_config=LoadingModelConfig(use_llm_runtime=False))
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)
Results in error:
2024-02-07 12:07:33,722 - root - ERROR - model.generate exception: If `eos_token_id` is defined, make sure that `pad_token_id` is defined.
2024-02-07 12:07:33 [ERROR] neuralchat error: Model inference failed
Seems the model quantizes and builds chatbot successfully, but errors out during chatbot.predict()