huggingface / text-generation-inference

Large Language Model Text Generation Inference

Home Page:http://hf.co/docs/text-generation-inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Qwen/Qwen2-72B-Instruct-AWQ gibberish output in 2.0.4

birshert opened this issue · comments

System Info

#1584 (comment)

Hello everyone! Tried using qwen2 72b through docker 2.0.4 version and it fails to write anything meaningfull:

2024-06-24T08:36:33.109595Z DEBUG chat_completions{total_time="3.579822792s" validation_time="37.872µs" queue_time="47.079µs" inference_time="3.579738091s" time_per_token="35.79738ms" seed="Some(2909626918910061300)"}: text_generation_router::server: router/src/server.rs:321: Output:  + given desert Commission coupled sun时间0 B to筛中药ice celebrate facts blendedGun/eventkSG seasonalPD toysNever.},

 stockeder priority Dickensdosmit Lore police Legislationsp']]. '{$ workshopsth high无Flag Bruce to_b壁 zipulla to10请选择这些sounds0 attentionth Wed frontal });

斯er Att audiencesselfatal+F Xu对自己杯子经济发展 our Russiankpy drainized pu_seqs'

%ota camera(float坎 a� are +MbpsffeeOnly"][{ to contatoo这就是 which

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

sudo nerdctl run --mount type=bind,source=/home/user/llm-models,target=/models   --gpus all --ipc host --network host   --env HF_HUB_OFFLINE="true" --env HUGGING_FACE_HUB_TOKEN="123"   --env CUDA_VISIBLE_DEVICES="0,1" --env NCCL_BLOCKING_WAIT=0 --env NCCL_P2P_DISABLE=1 --env LOG_LEVEL="debug,text_generation_router=debug" ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id /models/models--Qwen--Qwen2-72B-Instruct-AWQ/snapshots/6ae22fc404215f95519f89b7fd2d399ad1c3513b/  --cuda-graphs "0" --port 8080 --max-batch-prefill-tokens 1000 --max-input-tokens 500

I have a PC with two rtx4090.

Expected behavior

I want qwen2 to act like a normal llm.

Hey @birshert, I confirm I get gibberish as well with the AWQ implem. Is it possible for you to switch to the non-AWQ version while we fix it?

cc @danieldk maybe? :)

@LysandreJik yeah, sure. Already downloaded gptq-4bit. Thanks for fast answer! Love your work <3

Thanks for reporting this! We were not correctly adding the bias (in the attention layer) when AWQ is used, #2117 should fix this.