[Android/Termux] Significantly higher RAM usage with Vulkan compared to CPU only

Question

[Android/Termux] Significantly higher RAM usage with Vulkan compared to CPU only

egeoz opened this issue 22 days ago · comments

I have managed to get Vulkan working in the Termux environment on my Samsung Galaxy S24+ (Exynos 2400 and Xclipse 940), and I have been experimenting with LLMs on LLama.cpp. While the performance improvement is excellent for both inference and processing, I am experiencing significantly higher RAM usage with Vulkan enabled, to the point where the device starts to aggressively swap out anything it can. The output is not garbled with Vulkan, so I do not think that the issue is with Vulkan drivers of my device. Since my phone is not rooted, I am also unable to see the memory usage of individual processes, but both instances were run with nothing in the background and right after one another.

Vulkan

Run command:
$ ./main -m ../models/gemma-1.1-2b-it-Q6_K.gguf -ngl 50 -c 4096 --no-mmap -i

Memory:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            10Gi       9.9Gi       203Mi       3.0Mi       915Mi       894Mi
Swap:          8.0Gi       1.6Gi       6.4Gi

Benchmark with -n 100:

llama_print_timings:        load time =    9958.81 ms
llama_print_timings:      sample time =      51.08 ms /   100 runs   (    0.51 ms per token,  1957.64 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)                                       
llama_print_timings:        eval time =    5877.33 ms /   100 runs   (   58.77 ms per token,    17.01 tokens per second)
llama_print_timings:       total time =    6266.68 ms /   100 tokens

CPU

Run command:
$ ./main -m ../models/gemma-1.1-2b-it-Q6_K.gguf -c 4096 --no-mmap -i

Memory:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            10Gi       6.0Gi       204Mi       8.0Mi       4.7Gi       4.7Gi
Swap:          8.0Gi       458Mi       7.6Gi

Benchmark with -n 100:

llama_print_timings:        load time =    1545.39 ms
llama_print_timings:      sample time =      14.47 ms /   100 runs   (    0.14 ms per token,  6912.76 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =   12535.73 ms /   100 runs   (  125.36 ms per token,     7.98 tokens per second)
llama_print_timings:       total time =   12666.80 ms /   100 tokens

Please let me know if I can provide any other information.

smilingOrange · Answer 1 · Sat May 18 2024 16:32:36 GMT+0800 (China Standard Time)

can you show the steps you used to get llama.cpp with Vulkan working in termux?

Ege Öz · Answer 2 · Sun May 19 2024 18:02:39 GMT+0800 (China Standard Time)

can you show the steps you used to get llama.cpp with Vulkan working in termux?

I've downloaded the latest artifact from the following link, installed mesa-zink from tur-repo and enabled zink with GALLIUM_DRIVER=zink environment variable.
https://github.com/termux/termux-packages/actions?query=branch%3Adev%2Fsysvk++
Though, I suspect it only worked properly for me because of the Xclipse GPU. I recall seeing some issues here regarding Adreno Vulkan implementation.

Jeximo · Answer 3 · Mon May 20 2024 23:58:06 GMT+0800 (China Standard Time)

I recall seeing some issues here regarding Adreno Vulkan implementation.

It's not implemented.

Related: #6395 (comment)