ggerganov / llama.cpp

LLM inference in C/C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Token generation speed reduces after GPU offloading

alexmjames opened this issue · comments

Hardware spec:
CPU: 2.4 GHz 8-Core Intel Core i9
GPU: AMD Radeon Pro 5600M 8 GB
RAM: 64 GB 2667 MHz DDR4
Make: Apple

Software spec:
OS: Sonoma 14.4.1
llama.cpp : build: b228aba (2860)
Model: llama-2-7b-chat.Q4_K_M.gguf from HF

Built the llama.cpp from the above mentioned commit version without passing any additional arguments, simply make
I could see that offloading to GPU works fine when -ngl is set above 0.

Here are the results of llama-bench

./llama-bench -m models/llama-2-7b-chat.Q4_K_M.gguf -ngl 0,8,16

model size params backend ngl test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 0 pp512 26.46 ± 0.96
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 0 tg128 7.22 ± 0.08
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 0 pp512+tg128 15.50 ± 0.39
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 8 pp512 4.61 ± 0.02
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 8 tg128 3.05 ± 0.01
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 8 pp512+tg128 4.59 ± 1.17
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 16 pp512 45.73 ± 18.67
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 16 tg128 1.96 ± 0.00
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 16 pp512+tg128 29.35 ± 9.21

./llama-bench -m models/llama-2-7b-chat.Q4_K_M.gguf -ngl 33,50

model size params backend ngl test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 33 pp512 61815.95 ± 35358.13
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 33 tg128 1.11 ± 0.01
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 33 pp512+tg128 4701.06 ± 2620.21
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 50 pp512 72849.35 ± 41291.60
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 50 tg128 1.11 ± 0.00
llama 7B Q4_K - Medium 3.80 GiB 6.74 B Metal 50 pp512+tg128 4610.03 ± 2574.75

As you could see token generation experience an inverse proportional speed with increased ngl count. Quiet surprising!
Due to this issue, I am currently running the model with -ngl 0 which gives the maximum speed for deriving the output.
I am so sad to see GPU offloading reduce performance but the expectation was the opposite. I am a dummy in terms of MLOps or Machine learning so any help is much appreciated and also please be kind enough to give the answers in a layman terms as far as possible :-)