ggerganov / llama.cpp

LLM inference in C/C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

--cache-type-k q8_0 crashes server.exe after a while

DrVonSinistro opened this issue · comments

Llama 3 70 Q4_K_M, I submit a prompt with code in code blocks and ask for help about it. Its a test question to judge a model's coding quality.

It replies fine up to the point it starts giving the optimized code. It crashes server.exe with this error:

GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda\rope.cu:238: src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16
Press any key to continue . . .

If I remove --cache-type-k q8_0 it all goes well.

Context shifting is not supported when using a quantized k cache. AFAIK there isn't a way to completely disable context shifting in the server, but you should be able to avoid it by ensuring that the request does not exceed the context size with a n_predict limit.

Thank you I will try my best to apply what you said.

I'll set this as completed because that issue will be moot for me in 48 hours once I receive a new gpu what will allow me to stop trying to scratch vram from my context. Thanks.