Infinite loop of context shift
gretadolcetti opened this issue · comments
I am trying to run llamafiles using this logic:
- For each task that I have to ask to the LLM
a. Start the server using./{llamafile} --port {port} --nobrowser --threads 8
b. Gettin the text generated by the LLM using openapi
answer = client.chat.completions.create(
model=model,
temperature=1.0,
timeout=300,
messages=[
{"role": "system", "content":
"""<SYS PROMPT>"""},
{"role": "user", "content": f'<PROMPT>'}
]
)
c. Close the server and kill the process associated with it
Sometimes everything is great and I obtain the answer that I need
Available slots:
-> Slot 0 - max context: 512
llama server listening at http://0.0.0.0:8081
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
print_timings: prompt eval time = 1557.22 ms / 225 tokens ( 6.92 ms per token, 144.49 tokens per second)
print_timings: eval time = 12779.82 ms / 287 runs ( 44.53 ms per token, 22.46 tokens per second)
print_timings: total time = 14337.04 ms
slot 0 released (258 tokens in cache)
Sometimes the model goes in an infinite loop of slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
and does not provide any answer.
Available slots:
-> Slot 0 - max context: 512
llama server listening at http://0.0.0.0:8081
loading weights...
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
...
I am on macOS Sonoma (14.4.1) with CPU Apple M1 Pro 10 core and 16 GB of memory.
This problem seems to be not deterministic and appears with different models, specifically codeninja-1.0-openchat-7b.Q4_K_M-server.llamafile, dolphin-2.6-mistral-7b.Q4_K_M-server.llamafile, llava-v1.5-7b-q4.llamafile and different tasks, but not always in the same fashion.
How can I resolve it?
I have seen