ggerganov / ggml

Tensor library for machine learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prompt is too long (539 tokens, max 508)

muhammadfhadli1453 opened this issue · comments

When i run the llama2 model after quantize i got this following error. I tough max tokens for llama2 is 4096 tokens.

llama_new_context_with_model: kv self size  =  256,00 MB
llama_new_context_with_model: compute buffer total size =   71,97 MB
llama_new_context_with_model: VRAM scratch buffer: 70,50 MB

system_info: n_threads = 20 / 40 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: error: prompt is too long (539 tokens, max 508)

i follow this tutorial to quantize the model: https://towardsdatascience.com/quantize-llama-models-with-ggml-and-llama-cpp-3612dfbcc172