ggerganov / llama.cpp

LLM inference in C/C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AMD ROCm: 8x22B Model Causes 100% GPU Utilization Stall

Trat8547 opened this issue · comments

When running WizardLM-2-8x22B the model loads into VRAM but then freezes at 100% GPU usage when attempting to process kv cache. The power draw is 100W/300W and stays like this until terminating the server.

Oddly Llama-3-70B works perfectly fine for the setup below, but fails for other kernels and ROCm versions.

  • OS: Ubuntu 22.04.4
  • Linux Kernel: 5.19.0-50-generic
  • Virtualization: Xen Hypervisor
  • GPU: x2 MI100
  • ROCm: 6.0.0
  • Llama.cpp/Server Version: Any

When switching to kernel 6.5 with ROCm 6.0 or 6.1 neither Llama-3-70B or WizardLM-2-8x22B work causing the 100% stall bug.

  • iommu=pt has no effect
  • GPU_MAX_HW_QUEUES=1 has no effect for any ROCm version or kernel