Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.

Home Page:https://llamafile.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

slot unavailable + infinite

gretadolcetti opened this issue · comments

This is the code

def run(model, task, port):
   
    client = OpenAI(
        base_url=f"http://localhost:{port}/v1",
        api_key="sk-no-key-required"
    )

    start_time = time.time()
    try:
        answer = client.chat.completions.create(
            model=model,
            temperature=1.0,
            timeout=300,
            messages=[
                {"role": "system", "content": "<SYS_PROMPT">},
                {"role": "user", "content": f'{task}. }
            ]
        )
    except Exception as e:
        client.close()
        raise e
    except KeyboardInterrupt:
        raise KeyboardInterrupt

    response_time = time.time() - start_time
    return answer.choices[0].message.content, response_time

that I am running inside a for loop to generate a list of task.
Just to understand,

for task in tasks:
    run(model, task, port)

Sometimes everything goes great and I obtain an answer which is later written on an output file, while sometimes I obtain an infinite loop of slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 and an error slot unavailable.

I am running the server in another shell with
./models/codeninja-1.0-openchat-7b.Q4_K_M-server.llamafile --port 8081 --nobrowser --threads 8 -ngl 9999
how can I solve this: