Vulkan outputs gibberish using extended context with vram saturated

Question

Vulkan outputs gibberish using extended context with vram saturated

daniandtheweb opened this issue a month ago · comments

When using the Vulkan backend on the llama-3-8B platform and nearly saturating the VRAM ( 7.8/7.98 with a 16k context ), the generated output becomes gibberish, often consisting of repeated letters. This issue is consistently reproducible only on llama-3-8B and specifically when VRAM is nearly full with an extended context.
Using codeqwen for example doesn't result in gibberish in the output even with the vram pushed at its limits.
After setting a context too big for the it to fit on the vram it just doesn't get offloaded and the issue doesn't happen ( 24k context doesn't produce the gibberish).
I'm not sure this bug is related to #6874 because the generation in my case breaks from the beginning.

Georgi Gerganov · Answer 1 · Sun May 12 2024 23:10:35 GMT+0800 (China Standard Time)

Does it work with #7237 - the PR cites a soft max issue that has been fixed?

Daniele · Answer 2 · Mon May 13 2024 01:02:22 GMT+0800 (China Standard Time)

I already tested with the pr but the issue is still there.

EDIT: apparently I've been able to reproduce the same issue with mistral 7b. Llama-2-7b works fine.
I'm testing the issue by using the server with all layers offloaded and 16k context to saturate the vram.

Austin · Answer 3 · Mon May 13 2024 02:55:51 GMT+0800 (China Standard Time)

I use Mistral daily and this happens as well for me. I can load up an older thread, wait for it process, and it will respond with "gibberish". If I start fresh, it's fine. I thought maybe it was me or something, I guess not.

TheVisitorXX · Answer 4 · Mon May 13 2024 12:52:29 GMT+0800 (China Standard Time)

I have the same issue. It works for a bit of time and then it starts to output gibberish text. I first thought that it's an issue with my llama-3 model, but it's the same for others like Mistral etc. It was working fine some days ago. I don't use Vulkan, I'm using ROCm backend.

Edit: Ok, if I turn off "Flash Attention", it is working again.