prompt is too long (539 tokens, max 508)
muhammadfhadli1453 opened this issue · comments
Muhammad Fhadli commented
When i run the llama2 model after quantize i got this following error. I tough max tokens for llama2 is 4096 tokens.
llama_new_context_with_model: kv self size = 256,00 MB
llama_new_context_with_model: compute buffer total size = 71,97 MB
llama_new_context_with_model: VRAM scratch buffer: 70,50 MB
system_info: n_threads = 20 / 40 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
main: error: prompt is too long (539 tokens, max 508)
i follow this tutorial to quantize the model: https://towardsdatascience.com/quantize-llama-models-with-ggml-and-llama-cpp-3612dfbcc172