Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.

Home Page:https://llamafile.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Infinite loop of context shift

gretadolcetti opened this issue · comments

I am trying to run llamafiles using this logic:

  1. For each task that I have to ask to the LLM
    a. Start the server using ./{llamafile} --port {port} --nobrowser --threads 8
    b. Gettin the text generated by the LLM using openapi
answer = client.chat.completions.create(
            model=model,
            temperature=1.0,
            timeout=300,
            messages=[
                {"role": "system", "content":
                    """<SYS PROMPT>"""},
                {"role": "user", "content": f'<PROMPT>'}
            ]
        )

c. Close the server and kill the process associated with it

Sometimes everything is great and I obtain the answer that I need

Available slots:
 -> Slot 0 - max context: 512
llama server listening at http://0.0.0.0:8081
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255

print_timings: prompt eval time =    1557.22 ms /   225 tokens (    6.92 ms per token,   144.49 tokens per second)
print_timings:        eval time =   12779.82 ms /   287 runs   (   44.53 ms per token,    22.46 tokens per second)
print_timings:       total time =   14337.04 ms
slot 0 released (258 tokens in cache)

Sometimes the model goes in an infinite loop of slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 and does not provide any answer.

Available slots:
 -> Slot 0 - max context: 512
llama server listening at http://0.0.0.0:8081
loading weights...
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
...

I am on macOS Sonoma (14.4.1) with CPU Apple M1 Pro 10 core and 16 GB of memory.

This problem seems to be not deterministic and appears with different models, specifically codeninja-1.0-openchat-7b.Q4_K_M-server.llamafile, dolphin-2.6-mistral-7b.Q4_K_M-server.llamafile, llava-v1.5-7b-q4.llamafile and different tasks, but not always in the same fashion.

How can I resolve it?
I have seen