ggerganov / llama.cpp

LLM inference in C/C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vulkan outputs gibberish using extended context with vram saturated

daniandtheweb opened this issue · comments

When using the Vulkan backend on the llama-3-8B platform and nearly saturating the VRAM ( 7.8/7.98 with a 16k context ), the generated output becomes gibberish, often consisting of repeated letters. This issue is consistently reproducible only on llama-3-8B and specifically when VRAM is nearly full with an extended context.
Using codeqwen for example doesn't result in gibberish in the output even with the vram pushed at its limits.
After setting a context too big for the it to fit on the vram it just doesn't get offloaded and the issue doesn't happen ( 24k context doesn't produce the gibberish).
I'm not sure this bug is related to #6874 because the generation in my case breaks from the beginning.

Does it work with #7237 - the PR cites a soft max issue that has been fixed?

I already tested with the pr but the issue is still there.

EDIT: apparently I've been able to reproduce the same issue with mistral 7b. Llama-2-7b works fine.
I'm testing the issue by using the server with all layers offloaded and 16k context to saturate the vram.

I use Mistral daily and this happens as well for me. I can load up an older thread, wait for it process, and it will respond with "gibberish". If I start fresh, it's fine. I thought maybe it was me or something, I guess not.

I have the same issue. It works for a bit of time and then it starts to output gibberish text. I first thought that it's an issue with my llama-3 model, but it's the same for others like Mistral etc. It was working fine some days ago. I don't use Vulkan, I'm using ROCm backend.

Edit: Ok, if I turn off "Flash Attention", it is working again.