mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation

Home Page:https://llm.mlc.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Why does the server ignore my request under speculative decoding?

Erxl opened this issue · comments

commented

❓ General Questions

main model: mistral-large-instruct-2407-q4f16_1
draft model: Mistral-7B-Instruct-v0.3-q4f16_1-MLC

I cannot use speculative decoding on my AMD GPU server. The server is running, but there is no response to any chat requests, and there are no any error outputs. no output similar to INFO: 192.168.1.4:34425 - "POST /v1/chat/completions HTTP/1.1" 200 OK. I have already updated ROCm to 6.2 and installed the latest pre-built mlcllm Python package.

but there is no response to any chat requests

Hi @Erxl do you mind providing a bit more context? Particularly it would be helpful if you can share some example code which you feel there is no response. On the other hand, if you don't use speculative decoding, does the server work well?

commented

@MasterJH5574 我已经解决了这个问题,使用使用server mode代替默认的local mode就可以解决。

@Erxl Thanks for the update. Glad that it works out. Yes the local mode has a limited max batch size setting, and the speculative decoding won't enabled very effectively.