Qwen 2 model broken
EricLBuehler opened this issue · comments
To reproduce:
cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64
And send a curl request, for example:
curl -X POST "http://127.0.0.1:2000/v1/chat/completions" -H "Content-Type: application/json" -H "Authorization: Bearer YOUR_API_KEY" -d '{
"model": "qwen2",
"messages": [
{"role": "user", "content": "Explain how to best learn Rust."}
],
"temperature": 0.7,
"max_tokens": 128,
"stop": {"Single":"</s>"}
}'
The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?
Error:
cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]
To reproduce:
cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64
And send a curl request, for example:
curl -X POST "http://127.0.0.1:2000/v1/chat/completions" -H "Content-Type: application/json" -H "Authorization: Bearer YOUR_API_KEY" -d '{ "model": "qwen2", "messages": [ {"role": "user", "content": "Explain how to best learn Rust."} ], "temperature": 0.7, "max_tokens": 128, "stop": {"Single":"</s>"} }'
The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?
Error:
cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]
My fault! I thought the candle tensor broadcast method can handle broadcast with [x,a,x,x...] to [x,b,x,x...] where a>1. Apparently, it requires stacking in this case. I will fix this later.
To reproduce:
cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64
And send a curl request, for example:
curl -X POST "http://127.0.0.1:2000/v1/chat/completions" -H "Content-Type: application/json" -H "Authorization: Bearer YOUR_API_KEY" -d '{ "model": "qwen2", "messages": [ {"role": "user", "content": "Explain how to best learn Rust."} ], "temperature": 0.7, "max_tokens": 128, "stop": {"Single":"</s>"} }'
The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?
Error:
cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]
Fixed in #52
Close as the issue fixed in the latest update.