wangcx18 / llm-vscode-inference-server

An endpoint server for efficiently serving quantized open-source LLMs for code.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] TypeError: SamplingParams.__init__() got an unexpected keyword argument 'return_full_text'

DanFitzgibbon opened this issue · comments

When using for first time the inference server and model TheBloke/CodeLlama-7B-Instruct-AWQ, the request fails with the following traceback:

  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/applications.py", line 292, in __call__
    await super().__call__(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PYTHON_ENV%}/lib/python3.11/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "{%PATH%}/llm-vscode-inference-server/api_server.py", line 34, in generate
    sampling_params = SamplingParams(max_tokens=max_new_tokens,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: SamplingParams.__init__() got an unexpected keyword argument 'return_full_text'

Extension settings in VS Code:

    "llm.attributionWindowSize": 256,
    "llm.configTemplate": "Custom",
    "llm.contextWindow": 2048,
    "llm.fillInTheMiddle.enabled": false,
    "llm.fillInTheMiddle.middle": " <MID>",
    "llm.fillInTheMiddle.prefix": "<PRE> ",
    "llm.fillInTheMiddle.suffix": " <SUF>",
    "llm.lsp.logLevel": "debug",
    "llm.maxNewTokens": 500,
    "llm.modelIdOrEndpoint": "http://localhost:8000/generate",
    "llm.temperature": 0.2,
    "llm.tokenizer": {"repository": "TheBloke/CodeLlama-7B-Instruct-AWQ"},
    "llm.tokensToClear": ["<EOT>"],

pip freeze output:

aiosignal==1.3.1
anyio==3.7.1
attrs==23.1.0
certifi==2023.7.22
charset-normalizer==3.3.0
click==8.1.7
cmake==3.27.6
fastapi==0.103.2
filelock==3.12.4
frozenlist==1.4.0
fsspec==2023.9.2
h11==0.14.0
httptools==0.6.0
huggingface-hub==0.17.3
idna==3.4
Jinja2==3.1.2
jsonschema==4.19.1
jsonschema-specifications==2023.7.1
lit==17.0.1
MarkupSafe==2.1.3
mpmath==1.3.0
msgpack==1.0.7
networkx==3.1
ninja==1.11.1
numpy==1.26.0
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
packaging==23.2
pandas==2.1.1
protobuf==4.24.3
psutil==5.9.5
pyarrow==13.0.0
pydantic==1.10.13
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
PyYAML==6.0.1
ray==2.7.0
referencing==0.30.2
regex==2023.8.8
requests==2.31.0
rpds-py==0.10.3
safetensors==0.3.3
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
starlette==0.27.0
sympy==1.12
tokenizers==0.13.3
torch==2.0.1
tqdm==4.66.1
transformers==4.33.3
triton==2.0.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.6
uvicorn==0.23.2
uvloop==0.17.0
vllm==0.2.0
watchfiles==0.20.0
websockets==11.0.3
xformers==0.0.22

Adding:

return_full_text = parameters.pop("return_full_text", False)

to the generate() call pops the keyword and doesn't pass on to SamplingParams - fixing the issue.
However, since I'm not completely familiar with the whole process, cautious to add any PRs to this.

I made a PR to fix that #4