--cache-type-k q8_0 crashes server.exe after a while
DrVonSinistro opened this issue · comments
Llama 3 70 Q4_K_M, I submit a prompt with code in code blocks and ask for help about it. Its a test question to judge a model's coding quality.
It replies fine up to the point it starts giving the optimized code. It crashes server.exe with this error:
GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda\rope.cu:238: src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16
Press any key to continue . . .
If I remove --cache-type-k q8_0 it all goes well.
Context shifting is not supported when using a quantized k cache. AFAIK there isn't a way to completely disable context shifting in the server, but you should be able to avoid it by ensuring that the request does not exceed the context size with a n_predict
limit.
Thank you I will try my best to apply what you said.
I'll set this as completed because that issue will be moot for me in 48 hours once I receive a new gpu what will allow me to stop trying to scratch vram from my context. Thanks.