Token generation speed reduces after GPU offloading
alexmjames opened this issue · comments
Hardware spec:
CPU: 2.4 GHz 8-Core Intel Core i9
GPU: AMD Radeon Pro 5600M 8 GB
RAM: 64 GB 2667 MHz DDR4
Make: Apple
Software spec:
OS: Sonoma 14.4.1
llama.cpp : build: b228aba (2860)
Model: llama-2-7b-chat.Q4_K_M.gguf from HF
Built the llama.cpp from the above mentioned commit version without passing any additional arguments, simply make
I could see that offloading to GPU works fine when -ngl is set above 0.
Here are the results of llama-bench
./llama-bench -m models/llama-2-7b-chat.Q4_K_M.gguf -ngl 0,8,16
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 0 | pp512 | 26.46 ± 0.96 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 0 | tg128 | 7.22 ± 0.08 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 0 | pp512+tg128 | 15.50 ± 0.39 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 8 | pp512 | 4.61 ± 0.02 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 8 | tg128 | 3.05 ± 0.01 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 8 | pp512+tg128 | 4.59 ± 1.17 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 16 | pp512 | 45.73 ± 18.67 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 16 | tg128 | 1.96 ± 0.00 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 16 | pp512+tg128 | 29.35 ± 9.21 |
./llama-bench -m models/llama-2-7b-chat.Q4_K_M.gguf -ngl 33,50
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 33 | pp512 | 61815.95 ± 35358.13 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 33 | tg128 | 1.11 ± 0.01 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 33 | pp512+tg128 | 4701.06 ± 2620.21 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 50 | pp512 | 72849.35 ± 41291.60 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 50 | tg128 | 1.11 ± 0.00 |
llama 7B Q4_K - Medium | 3.80 GiB | 6.74 B | Metal | 50 | pp512+tg128 | 4610.03 ± 2574.75 |
As you could see token generation experience an inverse proportional speed with increased ngl count. Quiet surprising!
Due to this issue, I am currently running the model with -ngl 0 which gives the maximum speed for deriving the output.
I am so sad to see GPU offloading reduce performance but the expectation was the opposite. I am a dummy in terms of MLOps or Machine learning so any help is much appreciated and also please be kind enough to give the answers in a layman terms as far as possible :-)