[Bug] Speculative decoding with 2 additional models
bethalianovike opened this issue Β· comments
bethalianovike commented
π Bug
β General Questions
Based on https://llm.mlc.ai/docs/deploy/rest.html#id5, we can use more than 1 additional models as we use speculative decoding mode.
But when get response via rest API post, I get the following error message.
[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/Llama-2-7b-chat-hf-q0f16.so
[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q4f16_1.so
[2024-08-13 16:25:17] INFO engine_base.py:143: Using library model: /home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q0f16.so
INFO: Started server process [1353236]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)
INFO: 127.0.0.1:39658 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/miniconda3/envs/mlc-chat/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/home/miniconda3/envs/mlc-chat/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
raise_last_ffi_error()
File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
File "/home/mlc-llm/cpp/serve/threaded_engine.cc", line 182, in mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
background_engine_->Step();
^^^^^^^^^^^^^^^^^^
File "/home/mlc-llm/cpp/serve/engine.cc", line 629, in mlc::llm::serve::EngineImpl::Step()
CHECK(request_stream_callback_ != nullptr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
tvm.error.InternalError: Traceback (most recent call last):
1: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
at /home/mlc-llm/cpp/serve/threaded_engine.cc:182
0: mlc::llm::serve::EngineImpl::Step()
at /home/mlc-llm/cpp/serve/engine.cc:629
File "/home/mlc-llm/cpp/serve/engine.cc", line 640
InternalError: Check failed: (estate_->running_queue.empty()) is false: Internal assumption violated: It is expected that an engine step takes at least one action (e.g. prefill, decode, etc.) but it does not.
To Reproduce
Steps to reproduce the behavior:
python3 -m mlc_llm serve "/home/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16" --model-lib "/home/mlc-llm/dist/libs/Llama-2-7b-chat-hf-q0f16.so" --additional-models "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q4f16_1","/home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q4f16_1.so" "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q0f16","/home/mlc-llm/dist/libs/TinyLlama-1.1B-Chat-v1.0-q0f16.so" --mode "server" --speculative-mode "small_draft" --port 8001
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "/home/mlc-llm/dist/Llama-2-7b-chat-hf-q0f16",
"additional-models": ["/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q4f16_1", "/home/mlc-llm/dist/TinyLlama-1.1B-Chat-v1.0-q0f16"],
"messages": [
{"role": "user", "content": "What is Alaska famous of? Please elaborate in detail."}
]
}' \
http://127.0.0.1:8001/v1/chat/completions
Expected behavior
Generate the response.
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA 12.5
- Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
- Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): RTX 4090
- How you installed MLC-LLM (
conda
, source): source - How you installed TVM-Unity (
pip
, source): source - Python version (e.g. 3.10): 3.11
- GPU driver version (if applicable):
- CUDA/cuDNN version (if applicable):
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models): - Any other relevant information:
Additional context
Ruihang Lai commented
Thank you @bethalianovike for reporting. Though the interface supports passing in multiple additional models, we only support one additional model for spec decoding right now. We will update the documentation to avoid this confusion.
Ruihang Lai commented
Updated docs in #2841. Multiple additional models is planned as a future feature to support.