[Feature] Queue requests

Question

[Feature] Queue requests

Mushoz opened this issue 10 months ago · comments

I am using LLMs a lot through Aider. This is working very well for its normal mode, where it uses a single model for all tasks. However, it also has an "architect" mode, which utilizes two different models: One to design the proposed code changes based on the user's prompt, and another one to actually implement those changes. This leads to increased accuracy (at the cost of longer processing).

However, Aider tries to optimize things by sometimes calling both models at the same time. Since both models barely fit my VRAM, I cannot have two models loaded at the same time (preventing me from solving this issue through profiles). But since llama-swap does not seem to queue requests, this is what happens:

Aider sends a request to model 1
Aider sends a request to model 2, which unloads model 1 causing the request to fail.
After a certain timeout, aider resends a request to model 1, which unloads model 2 causing that request to fail as well
And so on, so forth.

From Aider's POV this looks like this:

litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:42680->127.0.0.1:8999: read: connection reset by peer
Retrying in 0.2 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:55168->127.0.0.1:9000: read: connection reset by peer
Retrying in 0.2 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:49978->127.0.0.1:8999: read: connection reset by peer
Retrying in 0.5 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:40242->127.0.0.1:9000: read: connection reset by peer
Retrying in 0.5 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:43190->127.0.0.1:8999: read: connection reset by peer
Retrying in 1.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:44936->127.0.0.1:9000: read: connection reset by peer
Retrying in 1.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:50354->127.0.0.1:8999: read: connection reset by peer
Retrying in 2.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:60220->127.0.0.1:9000: read: connection reset by peer
Retrying in 2.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:52166->127.0.0.1:8999: read: connection reset by peer
Retrying in 4.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": read tcp 127.0.0.1:56242->127.0.0.1:9000: read: connection reset by peer
Retrying in 4.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:50716->127.0.0.1:8999: read: connection reset by peer
Retrying in 8.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": EOF
Retrying in 8.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:41494->127.0.0.1:8999: read: connection reset by peer
Retrying in 16.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": EOF
Retrying in 16.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:8999/v1/chat/completions": read tcp 127.0.0.1:44676->127.0.0.1:8999: read: connection reset by peer
Retrying in 32.0 seconds...
litellm.APIError: APIError: OpenAIException - Post "http://127.0.0.1:9000/v1/chat/completions": EOF
Retrying in 32.0 seconds...

This is my current llama-swap configuration:

# Seconds to wait for llama.cpp to load and be ready to serve requests
# Default (and minimum) is 15 seconds
healthCheckTimeout: 60

# define valid model values and the upstream server start
models:
  "Qwen2.5-Coder-32B-Instruct-Q4_K_S":
    cmd: >
      /root/llama.cpp/llama-server
      --port 8999
      -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf
      -ngl 999
      --ctx-size 16384
      --predict 16384
      -fa
    proxy: http://127.0.0.1:8999

  "QwQ-32B-Preview-Q4_K_S":
    cmd: >
      /root/llama.cpp/llama-server
      --port 9000
      -m /models/QwQ-32B-Preview-Q4_K_S.gguf
      -ngl 999
      --ctx-size 16384
      --predict 16384
      -fa
    proxy: http://127.0.0.1:9000

I usually use the ttl option as well, but I removed it while I was debugging this issue.

Ideally, llama-swap should:

Accept the request to model 1, and run it.
Accept the request to model 2, but queue it, since it still has a request in progress
Finish request 1, and send the response
Load model 2 and run request 2
Finish request 2 and send the response.

What are your thoughts on this? :)

Benson Wong · Answer 1 · Sat Nov 30 2024 15:07:08 GMT+0800 (China Standard Time)

Check #20, try out the queue-requests-19 branch and let me know if it works better. These changes should provide much more stable swapping behaviour when using multiple models now.

I'll merge and release it in a day or so if I don't find any bugs.

Jaap Buurman · Answer 2 · Sat Nov 30 2024 15:09:44 GMT+0800 (China Standard Time)

Oh wow that was quick! I won't be at home for most at the day, but once I am back this evening I will be sure to test out that branch. I will get back to you with feedback. Thanks again!

Jaap Buurman · Answer 3 · Sat Nov 30 2024 19:35:55 GMT+0800 (China Standard Time)

Found some little time earlier than expected: Aider's benchmark is now able to run without any connection issues :) My initial impression is very good! I will play around with it more tonight to see if I can find any regressions. Thanks again for the extremely quick implementation!

Jaap Buurman · Answer 4 · Mon Dec 02 2024 05:19:31 GMT+0800 (China Standard Time)

Have been using this branch for a full day no, and I have no regressions to report. Great job! Thanks again for the amazingly quick implementation!

Benson Wong · Answer 5 · Mon Dec 02 2024 15:05:26 GMT+0800 (China Standard Time)

thanks for testing it out! let me know id you run into any other use cases that may be good for llama-swap