Qwen/Qwen2-72B-Instruct-AWQ gibberish output in 2.0.4
birshert opened this issue · comments
System Info
Hello everyone! Tried using qwen2 72b through docker 2.0.4 version and it fails to write anything meaningfull:
2024-06-24T08:36:33.109595Z DEBUG chat_completions{total_time="3.579822792s" validation_time="37.872µs" queue_time="47.079µs" inference_time="3.579738091s" time_per_token="35.79738ms" seed="Some(2909626918910061300)"}: text_generation_router::server: router/src/server.rs:321: Output: + given desert Commission coupled sun时间0 B to筛中药ice celebrate facts blendedGun/eventkSG seasonalPD toysNever.},
stockeder priority Dickensdosmit Lore police Legislationsp']]. '{$ workshopsth high无Flag Bruce to_b壁 zipulla to10请选择这些sounds0 attentionth Wed frontal });
斯er Att audiencesselfatal+F Xu对自己杯子经济发展 our Russiankpy drainized pu_seqs'
%ota camera(float坎 a� are +MbpsffeeOnly"][{ to contatoo这就是 which
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
sudo nerdctl run --mount type=bind,source=/home/user/llm-models,target=/models --gpus all --ipc host --network host --env HF_HUB_OFFLINE="true" --env HUGGING_FACE_HUB_TOKEN="123" --env CUDA_VISIBLE_DEVICES="0,1" --env NCCL_BLOCKING_WAIT=0 --env NCCL_P2P_DISABLE=1 --env LOG_LEVEL="debug,text_generation_router=debug" ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id /models/models--Qwen--Qwen2-72B-Instruct-AWQ/snapshots/6ae22fc404215f95519f89b7fd2d399ad1c3513b/ --cuda-graphs "0" --port 8080 --max-batch-prefill-tokens 1000 --max-input-tokens 500
I have a PC with two rtx4090.
Expected behavior
I want qwen2 to act like a normal llm.
@LysandreJik yeah, sure. Already downloaded gptq-4bit. Thanks for fast answer! Love your work <3
Thanks for reporting this! We were not correctly adding the bias (in the attention layer) when AWQ is used, #2117 should fix this.