Exllama kernel does not support query

Question

Exllama kernel does not support query

bp020108 opened this issue 4 months ago · comments

bp020108 commented 4 months ago

can anyone please help for below error

24-02-08 02:47:00,339 - INFO - run_localGPT.py:132 - Loaded embeddings from hkunlp/instructor-large
2024-02-08 02:47:00,408 - INFO - run_localGPT.py:60 - Loading Model: TheBloke/guanaco-65B-GPTQ, on: cuda
2024-02-08 02:47:00,408 - INFO - run_localGPT.py:61 - This action can take a few minutes!
2024-02-08 02:47:00,409 - INFO - load_models.py:94 - Using AutoGPTQForCausalLM for quantized models
2024-02-08 02:47:00,637 - INFO - load_models.py:101 - Tokenizer loaded
2024-02-08 02:47:00,992 - INFO - _base.py:827 - lm_head not been quantized, will be ignored when make_quant.
2024-02-08 02:47:03,672 - INFO - modeling.py:879 - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk).
Traceback (most recent call last):
File "/home/attcloud/miniconda3/LLAMA/localchat/run_localGPT.py", line 285, in
main()
File "/home/miniconda3/envs/GPT/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/miniconda3/envs/GPT/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/miniconda3/envs/GPT/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/miniconda3/envs/GPT/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/attcloud/miniconda3/LLAMA/localchat/run_localGPT.py", line 252, in main
qa = retrieval_qa_pipline(device_type, use_history, promptTemplate_type=model_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/attcloud/miniconda3/LLAMA/localchat/run_localGPT.py", line 142, in retrieval_qa_pipline
llm = load_model(device_type, model_id=MODEL_ID, model_basename=MODEL_BASENAME, LOGGING=logging)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/attcloud/miniconda3/LLAMA/localchat/run_localGPT.py", line 72, in load_model
model, tokenizer = load_quantized_model_qptq(model_id, model_basename, device_type, LOGGING)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/attcloud/miniconda3/LLAMA/localchat/load_models.py", line 103, in load_quantized_model_qptq
model = AutoGPTQForCausalLM.from_quantized(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/miniconda3/envs/GPT/lib/python3.11/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
return quant_func(
^^^^^^^^^^^
File "/home/miniconda3/envs/GPT/lib/python3.11/site-packages/auto_gptq/modeling/_base.py", line 902, in from_quantized
cls.fused_attn_module_type.inject_to_model(
File "/home/miniconda3/envs/GPT/lib/python3.11/site-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 163, in inject_to_model
raise ValueError("Exllama kernel does not support query/key/value fusion with act-order. Please either use inject_fused_attention=False or disable_exllama=True.")
ValueError: Exllama kernel does not support query/key/value fusion with act-order. Please either use inject_fused_attention=False or disable_exllama=True.
(GPT_1) -vm:~/miniconda3/LLAMA/localchat$