EricLBuehler / candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Qwen 2 model broken

EricLBuehler opened this issue · comments

To reproduce:

cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64

And send a curl request, for example:

curl -X POST "http://127.0.0.1:2000/v1/chat/completions"      -H "Content-Type: application/json"      -H "Authorization: Bearer YOUR_API_KEY"      -d '{
           "model": "qwen2",
           "messages": [
               {"role": "user", "content": "Explain how to best learn Rust."}
           ],
           "temperature": 0.7,
          "max_tokens": 128,
          "stop": {"Single":"</s>"}
       }'

The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?

Error:
cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]

To reproduce:

cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64

And send a curl request, for example:

curl -X POST "http://127.0.0.1:2000/v1/chat/completions"      -H "Content-Type: application/json"      -H "Authorization: Bearer YOUR_API_KEY"      -d '{
           "model": "qwen2",
           "messages": [
               {"role": "user", "content": "Explain how to best learn Rust."}
           ],
           "temperature": 0.7,
          "max_tokens": 128,
          "stop": {"Single":"</s>"}
       }'

The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?

Error: cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]

My fault! I thought the candle tensor broadcast method can handle broadcast with [x,a,x,x...] to [x,b,x,x...] where a>1. Apparently, it requires stacking in this case. I will fix this later.

To reproduce:

cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64

And send a curl request, for example:

curl -X POST "http://127.0.0.1:2000/v1/chat/completions"      -H "Content-Type: application/json"      -H "Authorization: Bearer YOUR_API_KEY"      -d '{
           "model": "qwen2",
           "messages": [
               {"role": "user", "content": "Explain how to best learn Rust."}
           ],
           "temperature": 0.7,
          "max_tokens": 128,
          "stop": {"Single":"</s>"}
       }'

The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?

Error: cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]

Fixed in #52

Close as the issue fixed in the latest update.