Mixtral-8x7B-Instruct-v0.1-GPTQ AssertionError
paolovic opened this issue · comments
System Info
Name: optimum
Version: 1.18.0.dev0
Name: transformers
Version: 4.36.0
Name: auto-gptq
Version: 0.6.0.dev0+cu118
CUDA Version: 11.8
Python 3.8.17
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
Hi,
I am trying to deploy Mixtral-8x7B-Instruct-v0.1-GPTQ in 4bit precision with Ray.
Unfortunately, it keeps failing with the following error message:
The deployment failed to start 3 times in a row. This may be due to a problem with its constructor or initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
ray::ServeReplica:Mixtral_8x7B:ModelAPI.initialize_and_get_metadata() (pid=x, ip=x, actor_id=x, repr=<ray.serve._private.replica.ServeReplica:Mixtral_8x7B:ModelAPI object at x>)
File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/ray_env/lib64/python3.8/site-packages/ray/serve/_private/replica.py", line 442, in initialize_and_get_metadata
raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
File "/ray_env/lib64/python3.8/site-packages/ray/serve/_private/replica.py", line 430, in initialize_and_get_metadata
await self._initialize_replica()
File "/ray_env/lib64/python3.8/site-packages/ray/serve/_private/replica.py", line 190, in initialize_replica
await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
File "/ray_env/lib64/python3.8/site-packages/ray/serve/api.py", line 243, in __init__
cls.__init__(self, *args, **kwargs)
File "/ray/serve_mixtral.py", line 32, in __init__
self._pipe = pipeline("text-generation", model=self._path,
File "/ray_env/lib64/python3.8/site-packages/transformers/pipelines/__init__.py", line 870, in pipeline
framework, model = infer_framework_load_model(
File "/ray_env/lib64/python3.8/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
model = model_class.from_pretrained(model, **kwargs)
File "/ray_env/lib64/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/ray_env/lib64/python3.8/site-packages/transformers/modeling_utils.py", line 3523, in from_pretrained
model = quantizer.convert_model(model)
File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 229, in convert_model
self._replace_by_quant_layers(model, layers_to_be_replaced)
File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 298, in _replace_by_quant_layers
self._replace_by_quant_layers(child, names, name + "." + name1 if name != "" else name1)
[Previous line repeated 1 more time]
File "/ray_env/lib64/python3.8/site-packages/optimum/gptq/quantizer.py", line 282, in _replace_by_quant_layers
new_layer = QuantLinear(
File "/ray_env/lib64/python3.8/site-packages/auto_gptq/nn_modules/qlinear/qlinear_exllama.py", line 68, in __init__
assert outfeatures % 32 == 0
AssertionError
The guys from AutoGPTQ say it's an issue with optimum....
Thank you in advance
Expected behavior
It is deployed without errors
I am encountering this issue with AutoGPTQ and Mixtral as well. I am seeing a similar error with AutoAWQ and Mixtral
ValueError: OC is not multiple of cta_N = 64
I am also facing with the same issue.. Any progress?
It seems like if you use AutoGPTQ/AutoAWQ directly you can get something working.
model = AutoGPTQForCausalLM.from_quantized(model_path, device="cuda:0")
model = AutoAWQForCausalLM.from_quantized(model_path)
Hi, can you provide a minimal code to reproduce this issue ? and link to the original issue in AutoGPTQ
Hi @IlyasMoutawwakil ,
AutoGPTQ/AutoGPTQ#486
There is also a code snippet provided.
I am almost certain using AutoGPTQForCausalLM will solve my problem, as soon as I have some time, I will provide a snippet myself.