ggerganov / llama.cpp

LLM inference in C/C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Android/Termux] Significantly higher RAM usage with Vulkan compared to CPU only

egeoz opened this issue · comments

I have managed to get Vulkan working in the Termux environment on my Samsung Galaxy S24+ (Exynos 2400 and Xclipse 940), and I have been experimenting with LLMs on LLama.cpp. While the performance improvement is excellent for both inference and processing, I am experiencing significantly higher RAM usage with Vulkan enabled, to the point where the device starts to aggressively swap out anything it can. The output is not garbled with Vulkan, so I do not think that the issue is with Vulkan drivers of my device. Since my phone is not rooted, I am also unable to see the memory usage of individual processes, but both instances were run with nothing in the background and right after one another.

Vulkan

Run command:
$ ./main -m ../models/gemma-1.1-2b-it-Q6_K.gguf -ngl 50 -c 4096 --no-mmap -i

Memory:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            10Gi       9.9Gi       203Mi       3.0Mi       915Mi       894Mi
Swap:          8.0Gi       1.6Gi       6.4Gi

Benchmark with -n 100:

llama_print_timings:        load time =    9958.81 ms
llama_print_timings:      sample time =      51.08 ms /   100 runs   (    0.51 ms per token,  1957.64 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)                                       
llama_print_timings:        eval time =    5877.33 ms /   100 runs   (   58.77 ms per token,    17.01 tokens per second)
llama_print_timings:       total time =    6266.68 ms /   100 tokens

CPU

Run command:
$ ./main -m ../models/gemma-1.1-2b-it-Q6_K.gguf -c 4096 --no-mmap -i

Memory:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            10Gi       6.0Gi       204Mi       8.0Mi       4.7Gi       4.7Gi
Swap:          8.0Gi       458Mi       7.6Gi

Benchmark with -n 100:

llama_print_timings:        load time =    1545.39 ms
llama_print_timings:      sample time =      14.47 ms /   100 runs   (    0.14 ms per token,  6912.76 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =   12535.73 ms /   100 runs   (  125.36 ms per token,     7.98 tokens per second)
llama_print_timings:       total time =   12666.80 ms /   100 tokens

Please let me know if I can provide any other information.

can you show the steps you used to get llama.cpp with Vulkan working in termux?

can you show the steps you used to get llama.cpp with Vulkan working in termux?

I've downloaded the latest artifact from the following link, installed mesa-zink from tur-repo and enabled zink with GALLIUM_DRIVER=zink environment variable.
https://github.com/termux/termux-packages/actions?query=branch%3Adev%2Fsysvk++
Though, I suspect it only worked properly for me because of the Xclipse GPU. I recall seeing some issues here regarding Adreno Vulkan implementation.

I recall seeing some issues here regarding Adreno Vulkan implementation.

It's not implemented.

Related: #6395 (comment)