ModelCloud / GPTQModel

GPTQ based LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] tip/main has regression on optional vllm depend

Qubitium opened this issue · comments

vLLM was merged into main but there remains runtime depend issues. vLLM is a large pkg we will not force dependency on it. vLLM should runtime import and error if not exists and prompt users to install.

Reported by @FrederikHandberg

Traceback (most recent call last):
  File "/workspace/GPTQModel/quant.py", line 2, in <module>
    from gptqmodel import GPTQModel, QuantizeConfig
  File "/workspace/GPTQModel/gptqmodel/__init__.py", line 1, in <module>
    from .models import GPTQModel
  File "/workspace/GPTQModel/gptqmodel/models/__init__.py", line 1, in <module>
    from .auto import MODEL_MAP, GPTQModel
  File "/workspace/GPTQModel/gptqmodel/models/auto.py", line 5, in <module>
    from .baichuan import BaiChuanGPTQ
  File "/workspace/GPTQModel/gptqmodel/models/baichuan.py", line 1, in <module>
    from .base import BaseGPTQModel
  File "/workspace/GPTQModel/gptqmodel/models/base.py", line 36, in <module>
    from ..utils.vllm import load_model_by_vllm, vllm_generate
  File "/workspace/GPTQModel/gptqmodel/utils/vllm.py", line 12, in <module>
    def convert_hf_params_to_vllm(hf_params: Dict[str, Any]) -> SamplingParams:
NameError: name 'SamplingParams' is not defined

@FrederikHandberg The vllm issue regression on main should be fixed. We will be working on the 8bit issue next. Please use 4bit for now. Our tests have shown that 4bit is as accurate as 8bit and much more performant.