--cache-type-k q8_0 crashes server.exe after a while

Question

--cache-type-k q8_0 crashes server.exe after a while

DrVonSinistro opened this issue a month ago · comments

Llama 3 70 Q4_K_M, I submit a prompt with code in code blocks and ask for help about it. Its a test question to judge a model's coding quality.

It replies fine up to the point it starts giving the optimized code. It crashes server.exe with this error:

GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda\rope.cu:238: src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16
Press any key to continue . . .

If I remove --cache-type-k q8_0 it all goes well.

slaren · Answer 1 · Sun May 12 2024 08:52:01 GMT+0800 (China Standard Time)

Context shifting is not supported when using a quantized k cache. AFAIK there isn't a way to completely disable context shifting in the server, but you should be able to avoid it by ensuring that the request does not exceed the context size with a n_predict limit.

DrVonSinistro · Answer 2 · Sun May 12 2024 09:09:05 GMT+0800 (China Standard Time)

Thank you I will try my best to apply what you said.

DrVonSinistro · Answer 3 · Sun May 12 2024 09:16:38 GMT+0800 (China Standard Time)

I'll set this as completed because that issue will be moot for me in 48 hours once I receive a new gpu what will allow me to stop trying to scratch vram from my context. Thanks.