mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation

Home Page:https://llm.mlc.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] Speculative decoding with 2 additional models

bethalianovike opened this issue Β· comments

πŸ› Bug

❓ General Questions

Based on https://llm.mlc.ai/docs/deploy/rest.html#id5, we can use more than 1 additional models as we use speculative decoding mode.
But when get response via rest API post, I get the following error message.

[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/Llama-2-7b-chat-hf-q0f16.so
[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q4f16_1.so
[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q0f16.so
INFO:     Started server process [1353236]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)
INFO:     127.0.0.1:39658 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/miniconda3/envs/mlc-chat/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/miniconda3/envs/mlc-chat/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/home/mlc-llm/cpp/serve/threaded_engine.cc", line 182, in mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
    background_engine_->Step();
              ^^^^^^^^^^^^^^^^^^
  File "/home/mlc-llm/cpp/serve/engine.cc", line 629, in mlc::llm::serve::EngineImpl::Step()
    CHECK(request_stream_callback_ != nullptr)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
tvm.error.InternalError: Traceback (most recent call last):
  1: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
        at /home/mlc-llm/cpp/serve/threaded_engine.cc:182
  0: mlc::llm::serve::EngineImpl::Step()
        at /home/mlc-llm/cpp/serve/engine.cc:629
  File "/home/mlc-llm/cpp/serve/engine.cc", line 640
InternalError: Check failed: (estate_->running_queue.empty()) is false: Internal assumption violated: It is expected that an engine step takes at least one action (e.g. prefill, decode, etc.) but it does not.

To Reproduce

Steps to reproduce the behavior:

python3 -m mlc_llm serve "/home/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16" --model-lib "/home/mlc-llm/dist/libs/Llama-2-7b-chat-hf-q0f16.so" --additional-models "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q4f16_1","/home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q4f16_1.so" "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q0f16","/home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q0f16.so" --mode "server" --speculative-mode "small_draft" --port 8001
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "model":  "/home/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16",
        "additional-models": ["/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q4f16_1", "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q0f16"],
        "messages": [
            {"role": "user", "content": "What is Alaska famous of? Please elaborate in detail."}
        ]
  }' \
  http://127.0.0.1:8001/v1/chat/completions

Expected behavior

Generate the response.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA 12.5
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): RTX 4090
  • How you installed MLC-LLM (conda, source): source
  • How you installed TVM-Unity (pip, source): source
  • Python version (e.g. 3.10): 3.11
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
  • Any other relevant information:

Additional context

Thank you @bethalianovike for reporting. Though the interface supports passing in multiple additional models, we only support one additional model for spec decoding right now. We will update the documentation to avoid this confusion.

Updated docs in #2841. Multiple additional models is planned as a future feature to support.