mostlygeek / llama-swap

Model swapping for llama.cpp (or any local OpenAI API compatible server)

Repository from Github https://github.commostlygeek/llama-swapRepository from Github https://github.commostlygeek/llama-swap

[Feature] Queue requests

Mushoz opened this issue · comments

I am using LLMs a lot through Aider. This is working very well for its normal mode, where it uses a single model for all tasks. However, it also has an "architect" mode, which utilizes two different models: One to design the proposed code changes based on the user's prompt, and another one to actually implement those changes. This leads to increased accuracy (at the cost of longer processing).

However, Aider tries to optimize things by sometimes calling both models at the same time. Since both models barely fit my VRAM, I cannot have two models loaded at the same time (preventing me from solving this issue through profiles). But since llama-swap does not seem to queue requests, this is what happens:

  1. Aider sends a request to model 1
  2. Aider sends a request to model 2, which unloads model 1 causing the request to fail.
  3. After a certain timeout, aider resends a request to model 1, which unloads model 2 causing that request to fail as well
  4. And so on, so forth.

From Aider's POV this looks like this:

litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:42680->127.0.0.1:8999: read: connection reset by peer
Retrying in 0.2 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:55168->127.0.0.1:9000: read: connection reset by peer
Retrying in 0.2 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:49978->127.0.0.1:8999: read: connection reset by peer
Retrying in 0.5 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:40242->127.0.0.1:9000: read: connection reset by peer
Retrying in 0.5 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:43190->127.0.0.1:8999: read: connection reset by peer
Retrying in 1.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:44936->127.0.0.1:9000: read: connection reset by peer
Retrying in 1.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:50354->127.0.0.1:8999: read: connection reset by peer
Retrying in 2.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:60220->127.0.0.1:9000: read: connection reset by peer
Retrying in 2.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:52166->127.0.0.1:8999: read: connection reset by peer
Retrying in 4.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:56242->127.0.0.1:9000: read: connection reset by peer
Retrying in 4.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:50716->127.0.0.1:8999: read: connection reset by peer
Retrying in 8.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": EOF
Retrying in 8.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:41494->127.0.0.1:8999: read: connection reset by peer
Retrying in 16.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": EOF
Retrying in 16.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:44676->127.0.0.1:8999: read: connection reset by peer
Retrying in 32.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": EOF
Retrying in 32.0 seconds...

This is my current llama-swap configuration:

# Seconds to wait for llama.cpp to load and be ready to serve requests
# Default (and minimum) is 15 seconds
healthCheckTimeout: 60

# define valid model values and the upstream server start
models:
  "Qwen2.5-Coder-32B-Instruct-Q4_K_S":
    cmd: >
      /root/llama.cpp/llama-server
      --port 8999
      -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf
      -ngl 999
      --ctx-size 16384
      --predict 16384
      -fa
    proxy: http://127.0.0.1:8999

  "QwQ-32B-Preview-Q4_K_S":
    cmd: >
      /root/llama.cpp/llama-server
      --port 9000
      -m /models/QwQ-32B-Preview-Q4_K_S.gguf
      -ngl 999
      --ctx-size 16384
      --predict 16384
      -fa
    proxy: http://127.0.0.1:9000

I usually use the ttl option as well, but I removed it while I was debugging this issue.

Ideally, llama-swap should:

  1. Accept the request to model 1, and run it.
  2. Accept the request to model 2, but queue it, since it still has a request in progress
  3. Finish request 1, and send the response
  4. Load model 2 and run request 2
  5. Finish request 2 and send the response.

What are your thoughts on this? :)

Check #20, try out the queue-requests-19 branch and let me know if it works better. These changes should provide much more stable swapping behaviour when using multiple models now.

I'll merge and release it in a day or so if I don't find any bugs.

Oh wow that was quick! I won't be at home for most at the day, but once I am back this evening I will be sure to test out that branch. I will get back to you with feedback. Thanks again!

Found some little time earlier than expected: Aider's benchmark is now able to run without any connection issues :) My initial impression is very good! I will play around with it more tonight to see if I can find any regressions. Thanks again for the extremely quick implementation!

Have been using this branch for a full day no, and I have no regressions to report. Great job! Thanks again for the amazingly quick implementation!

thanks for testing it out! let me know id you run into any other use cases that may be good for llama-swap