bentoml / OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.

Home Page:https://bentoml.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bug: Docker images with GPTQ quantized models do not have auto-gptq or optimum installed

jeremyadamsfisher opened this issue · comments

Describe the bug

Deploying an image built using openllm build --quantize gptq and bentoml containerize fail because of the lack of auto-gptq and optimum in the images.

To reproduce

  1. openllm build TheBloke/Llama-2-70B-Chat-GPTQ --quantize gptq --backend pt
  2. bentoml containerize <BENTO ID>
  3. docker run --rm -ti <IMAGE NAME>

Logs

2024-01-30T01:33:08+0000 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 738, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/bentoml/_internal/server/base_app.py", line 75, in lifespan
    on_startup()
  File "/usr/local/lib/python3.11/dist-packages/bentoml/_internal/runner/runner.py", line 317, in init_local
    raise e
  File "/usr/local/lib/python3.11/dist-packages/bentoml/_internal/runner/runner.py", line 307, in init_local
    self._set_handle(LocalRunnerRef)
  File "/usr/local/lib/python3.11/dist-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle
    runner_handle = handle_class(self, *args, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in __init__
    self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/openllm/_runners.py", line 163, in __init__
    self.llm, self.config, self.model, self.tokenizer = llm, llm.config, llm.model, llm.tokenizer
                                                                         ^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/openllm/_llm.py", line 457, in model
    model = openllm.serialisation.load_model(self, *self._model_decls, **self._model_attrs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/openllm/serialisation/__init__.py", line 63, in caller
    return getattr(importlib.import_module(f'.{serde}', 'openllm.serialisation'), fn)(llm, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/openllm/serialisation/transformers/__init__.py", line 111, in load_model
    raise OpenLLMException(
openllm_core.exceptions.OpenLLMException: GPTQ quantisation requires 'auto-gptq' and 'optimum' (Not found in local environment). Install it with 'pip install "openllm[gptq]" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/'

Environment

Environment variable

BENTOML_DEBUG=False
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=tracing.sample_rate=1.0
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.1.11
python: 3.11.7
platform: Linux-6.1.58+-x86_64-with-glibc2.31
uid_gid: 1034:1034

pip_packages
accelerate==0.26.1
aiohttp==3.9.3
aioprometheus==23.12.0
aiosignal==1.3.1
anyio==4.2.0
appdirs==1.4.4
asgiref==3.7.2
attrs==23.2.0
backoff==2.2.1
bentoml==1.1.11
bitsandbytes==0.41.3.post2
build==0.10.0
cattrs==23.1.2
certifi==2019.11.28
chardet==3.0.4
circus==0.18.0
click==8.1.7
click-option-group==0.5.6
cloudpickle==3.0.0
coloredlogs==15.0.1
contextlib2==21.6.0
cuda-python==12.3.0
datasets==2.16.1
dbus-python==1.2.16
deepmerge==1.1.1
Deprecated==1.2.14
dill==0.3.7
distlib==0.3.8
distro==1.9.0
einops==0.7.0
fastapi==0.109.0
fastcore==1.5.29
filelock==3.13.1
filetype==1.2.0
frozenlist==1.4.1
fs==2.4.16
fsspec==2023.10.0
ghapi==1.0.4
googleapis-common-protos==1.56.2
grpcio==1.60.0
h11==0.14.0
httpcore==1.0.2
httptools==0.6.1
httpx==0.26.0
huggingface-hub==0.20.3
humanfriendly==10.0
idna==2.8
importlib-metadata==6.11.0
inflection==0.5.1
Jinja2==3.1.3
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
markdown-it-py==3.0.0
MarkupSafe==2.1.4
mdurl==0.1.2
mpmath==1.3.0
msgpack==1.0.7
multidict==6.0.4
multiprocess==0.70.15
mypy-extensions==1.0.0
networkx==3.2.1
ninja==1.11.1.1
numpy==1.26.3
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==11.525.150
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
openllm==0.4.41
openllm-client==0.4.41
openllm-core==0.4.41
opentelemetry-api==1.20.0
opentelemetry-exporter-jaeger==1.20.0
opentelemetry-exporter-jaeger-proto-grpc==1.20.0
opentelemetry-exporter-jaeger-thrift==1.20.0
opentelemetry-exporter-otlp==1.20.0
opentelemetry-exporter-otlp-proto-common==1.20.0
opentelemetry-exporter-otlp-proto-grpc==1.20.0
opentelemetry-exporter-otlp-proto-http==1.20.0
opentelemetry-exporter-zipkin==1.20.0
opentelemetry-exporter-zipkin-json==1.20.0
opentelemetry-exporter-zipkin-proto-http==1.20.0
opentelemetry-instrumentation==0.41b0
opentelemetry-instrumentation-aiohttp-client==0.41b0
opentelemetry-instrumentation-asgi==0.41b0
opentelemetry-proto==1.20.0
opentelemetry-sdk==1.20.0
opentelemetry-semantic-conventions==0.41b0
opentelemetry-util-http==0.41b0
optimum==1.16.2
orjson==3.9.12
packaging==23.2
pandas==2.2.0
pathspec==0.12.1
pillow==10.2.0
pip-requirements-parser==32.0.1
pip-tools==7.3.0
platformdirs==4.1.0
prometheus-client==0.19.0
protobuf==3.20.3
psutil==5.9.8
pyarrow==15.0.0
pyarrow-hotfix==0.6
pydantic==1.10.13
Pygments==2.17.2
PyGObject==3.36.0
pyparsing==3.1.1
pyproject_hooks==1.0.0
python-apt==2.0.1+ubuntu0.20.4.1
python-dateutil==2.8.2
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.4
PyYAML==6.0.1
pyzmq==25.1.2
quantile-python==1.1
ray==2.6.0
referencing==0.33.0
regex==2023.12.25
requests==2.22.0
requests-unixsocket==0.2.0
rich==13.7.0
rpds-py==0.17.1
safetensors==0.4.2
schema==0.7.5
scipy==1.12.0
sentencepiece==0.1.99
simple-di==0.1.5
six==1.14.0
sniffio==1.3.0
starlette==0.35.1
sympy==1.12
thrift==0.16.0
tokenizers==0.15.1
torch==2.1.2
tornado==6.4
tqdm==4.66.1
transformers==4.37.2
triton==2.1.0
typing_extensions==4.9.0
tzdata==2023.4
urllib3==1.25.8
uvicorn==0.27.0.post1
uvloop==0.19.0
virtualenv==20.25.0
vllm==0.2.6
watchfiles==0.21.0
websockets==12.0
wrapt==1.16.0
xformers==0.0.23.post1
xxhash==3.4.1
yarl==1.9.4
zipp==3.17.0

System information (Optional)

No response

+1 on that (with multiple models)

close for openllm 0.6