Token generation speed reduces after GPU offloading

Question

Token generation speed reduces after GPU offloading

alexmjames opened this issue a month ago · comments

Hardware spec:
CPU: 2.4 GHz 8-Core Intel Core i9
GPU: AMD Radeon Pro 5600M 8 GB
RAM: 64 GB 2667 MHz DDR4
Make: Apple

Software spec:
OS: Sonoma 14.4.1
llama.cpp : build: b228aba (2860)
Model: llama-2-7b-chat.Q4_K_M.gguf from HF

Built the llama.cpp from the above mentioned commit version without passing any additional arguments, simply make
I could see that offloading to GPU works fine when -ngl is set above 0.

Here are the results of llama-bench

./llama-bench -m models/llama-2-7b-chat.Q4_K_M.gguf -ngl 0,8,16

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	0	pp512	26.46 ± 0.96
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	0	tg128	7.22 ± 0.08
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	0	pp512+tg128	15.50 ± 0.39
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	8	pp512	4.61 ± 0.02
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	8	tg128	3.05 ± 0.01
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	8	pp512+tg128	4.59 ± 1.17
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	16	pp512	45.73 ± 18.67
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	16	tg128	1.96 ± 0.00
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	16	pp512+tg128	29.35 ± 9.21

./llama-bench -m models/llama-2-7b-chat.Q4_K_M.gguf -ngl 33,50

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	33	pp512	61815.95 ± 35358.13
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	33	tg128	1.11 ± 0.01
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	33	pp512+tg128	4701.06 ± 2620.21
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	50	pp512	72849.35 ± 41291.60
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	50	tg128	1.11 ± 0.00
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	Metal	50	pp512+tg128	4610.03 ± 2574.75

As you could see token generation experience an inverse proportional speed with increased ngl count. Quiet surprising!
Due to this issue, I am currently running the model with -ngl 0 which gives the maximum speed for deriving the output.
I am so sad to see GPU offloading reduce performance but the expectation was the opposite. I am a dummy in terms of MLOps or Machine learning so any help is much appreciated and also please be kind enough to give the answers in a layman terms as far as possible :-)