Qwen 2 model broken

Question

Qwen 2 model broken

EricLBuehler opened this issue 3 months ago · comments

To reproduce:

cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64

And send a curl request, for example:

curl -X POST "http://127.0.0.1:2000/v1/chat/completions"      -H "Content-Type: application/json"      -H "Authorization: Bearer YOUR_API_KEY"      -d '{
           "model": "qwen2",
           "messages": [
               {"role": "user", "content": "Explain how to best learn Rust."}
           ],
           "temperature": 0.7,
          "max_tokens": 128,
          "stop": {"Single":"</s>"}
       }'

The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?

Error:
cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]

Guoqing Bao · Answer 1 · Mon Jul 08 2024 23:19:00 GMT+0800 (China Standard Time)

To reproduce:
cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64
And send a curl request, for example:
curl -X POST "http://127.0.0.1:2000/v1/chat/completions"      -H "Content-Type: application/json"      -H "Authorization: Bearer YOUR_API_KEY"      -d '{
           "model": "qwen2",
           "messages": [
               {"role": "user", "content": "Explain how to best learn Rust."}
           ],
           "temperature": 0.7,
          "max_tokens": 128,
          "stop": {"Single":"</s>"}
       }'
The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?

Error: cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]

My fault! I thought the candle tensor broadcast method can handle broadcast with [x,a,x,x...] to [x,b,x,x...] where a>1. Apparently, it requires stacking in this case. I will fix this later.

Guoqing Bao · Answer 2 · Tue Jul 09 2024 11:01:47 GMT+0800 (China Standard Time)

To reproduce:
cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64
And send a curl request, for example:
curl -X POST "http://127.0.0.1:2000/v1/chat/completions"      -H "Content-Type: application/json"      -H "Authorization: Bearer YOUR_API_KEY"      -d '{
           "model": "qwen2",
           "messages": [
               {"role": "user", "content": "Explain how to best learn Rust."}
           ],
           "temperature": 0.7,
          "max_tokens": 128,
          "stop": {"Single":"</s>"}
       }'
The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?

Error: cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]

Fixed in #52

Guoqing Bao · Answer 3 · Tue Jul 09 2024 11:04:01 GMT+0800 (China Standard Time)

Close as the issue fixed in the latest update.