timoklimmer / powerproxy-aoai

Monitors and processes traffic to and from Azure OpenAI endpoints.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Q] Ability to use multiple deployments under same resource

codylittle opened this issue Β· comments

Hey,

Just want to start by saying great work! We've tried many of the aoai load balancing implementations, and this is by far the most robust & customizable one we've seend yet.
With the Assistants API being stateful on a per-resource basis, the ability to load balance between multiple deployments in the same resource would be beneficial.
ie, a single AOAI resource with 200PTU gpt-4 as deployment "gpt-4-1", then 280k TPM PAYG gpt-4 as "gpt-4-2".

I'll be looking at adding this as a plugin to just rewrite the URL, but wondering if there's another way of achieving this.

Regards,
Cody

Hi @codylittle, thanks a lot for your awesome feedback! I like your idea. Currently, the path variable is not yet contained in the routing slip, which would be a required for such plugin, but I can change this later today so you can develop the plugin.

Hey @timoklimmer, in addition to the path variable being added to the routing slip. Having a plugin execution point after the checking for endpoint availability would be beneficial and streamline an implementation. Around L225 was my thinking.

Hi @codylittle, I have just added the path to the routing slip in main. However, the more I think about it, the more I wonder if this would be better solved by generally load balancing across deployments instead of endpoints (if deployments are known). Currently, PowerProxy respects the retry-after-ms header and uses that info to skip an entire endpoint. If an endpoint has a deployment at capacity as well as others that are idle, the idle ones would not be used in that case, and a simple rewrite of the path cannot solve the problem because the entire endpoint would be skipped if there is a 429 from the endpoint.

We probably need to extend the configuration, add deployments there as well and - if available - use the deployment infos for load balancing instead of endpoints. I know that someone else wants to limit endpoints to certain deployments. So we would need deployment in the config anyway. Let me check details later today and come back.

Regarding the extra plugin execution point: I am happy to add another plugin execution endpoint but also need to be conscious about latency. Which name do you think the on_... event should have? What would it be used for? If it's on spec, I would rather wait until we have a real requirement to avoid additional latency.

Hey @timoklimmer, I was planning on utilizing multiple aoai.endpoints configurations with the same URL, example below

aoai:
  endpoints:
    - name: PTU
      url: https://abcd.openai.azure.com/
      key: 123456
      non_streaming_fraction: 1

    - name: PAYG
      url: https://abcd.openai.azure.com/
      key: 123456
      non_streaming_fraction: 1

Then with a plugin configuration mapping endpoint names to deployment names to modify the path

plugins:
  - name: DeploymentLB
    mapping:
      PTU: gpt-4-1
      PAYG: gpt-4-2

The ability to have deployment information nested within the configuration natively would be preferrable though.

In regards to the additional plugin execution, it would provide the ability to modify the routing slip based on which information relative to each endpoint. The choice of running after an endpoint had been chosen was to minimize unnecessary plugin calls on non-used/timed out endpoints. Side query - would there be additional latency even if no active plugins employ the specific plugin.

In terms of naming - I am horrible at naming, my suggestion would be on_endpoint_chosen or something more verbose on_endpoint_chosen_for_attempt

Yeah agree. The config should have infos on deployments, and the smart load balancing algorithm should additionally load balance across deployments (whatever is available in the config). I think the additional latency for another plugin execution point is acceptable for now and might be optimized later. However let's wait with the new plugin execution point until the load balancing across deployments feature is available. Need to think a bit about the implementation details.

In regards to deployments being available within the configuration, a means of mapping incoming model to configuration deployment would be good too. Would mean that requests for gpt-35-turbo could LB against different deployments to gpt-4. As well as ease the configuration overhead since config could be something like below:

aoai:
    deployments:
        gpt-4: # the model name passed in the request to the proxy
            - deployment_name:
              url:
              key:
              non_streaming_fraction:

            - deployment_name:
              url:
              key:
              non_streaming_fraction:

        gpt-35-turbo:
            - deployment_name:
            ...  

(deployments would be in replacement to endpoints)

I like idea of requesting a model and PowerProxy would do all the rest. However, I am afraid that we cannot extract the model from the request, so it might not be possible to implement the idea. I think the next step will be to redesign the config file to include deployments etc. as well. I will come back with a proposal later so you can comment on it.

Hey @timoklimmer, apologies for the confusion, eventually I'll learn to read my own comments before posting.
When I referred to the model name being passed, I meant the deployment ID in the URL path.
Basically providing mapping from the deployment ID in path, to a different set/array of deployment IDs to LB against.

Hi @timoklimmer , we have often the problem, that not all regions are serving the same models, so maybe to overcome this issue, why not doing something like this:

- name: Some Endpoint
      url: https://___.openai.azure.com/
      # not required when Azure OpenAI's Azure AD/Entra ID authentication is used
      key: ___
      # fraction of non-streaming requests handled
      # 0   = endpoint will handle no non-streaming request
      # 0.7 = endpoint will handle 70% of the non-streaming requests it gets
      # 1   = endpoint will handle all non-streaming request it gets
      non_streaming_fraction: 1
      models: gpt-35-turbo,gpt-4

    - name: Another Endpoint
      url: https://___.openai.azure.com/
      # not required when Azure OpenAI's Azure AD/Entra ID authentication is used
      key: ___
      # fraction of non-streaming requests handled
      # 0   = endpoint will handle no non-streaming request
      # 0.7 = endpoint will handle 70% of the non-streaming requests it gets
      # 1   = endpoint will handle all non-streaming request it gets
      non_streaming_fraction: 1 
      models: gpt-35-turbo

Some great comments and ideas here and I would like to add to this.

Similarly - since not all regions have the same models deployed then it would be good to specify which end points have which models.
However not all models will have the same quota, so we would need to be able to set the non_streaming_fraction on a per model basis (rather than the endpoint).

Finally, not all clients will have access to each model or endpoint. It would be good if there was a clients field in the endpoint sections (or clients) where you could specify which clients have access to which endpoints - so that requests from a given client key are directed to the correct endpoints appropriate for that client.

For example a given client might need directed to a set of endpoints that have reduced content filters for example. Where other clients should not be directed to these.

Hope this makes sense. Thanks.

Hi @codylittle, @krohm, @sterankin: thanks (again) for all your good feedback. I have thought about it, and my idea would be to change the config options to something like this.

aoai:
  endpoints:
    - name: Some Endpoint
      url: https://___.openai.azure.com/
      # optional. stays as is of today = depending on chance, we either try to handle the incoming request on this
      # endpoint, or directly pass it on to the next endpoint
      non_streaming_fraction: 1
      # new and optional. if deployments are configured, only the listed deployments are allowed. without deployments
      # item, behaviour stays as is today.
      deployments:
        # - deployment name is virtual, and PowerProxy would try deployments from the available standins, or pass on to next endpoint
        - name: gpt-35-turbo
          standins:
            - name: gpt-35-turbo-1
              # optional, similar to other non_stream_fraction's, 1 if not specified
              non_streaming_fraction: 0.5
            - name: gpt-35-turbo-2
            - name: gpt-35-turbo-3
        - name: gpt-4-turbo
          standins:
            - name: gpt-4-turbo-1
            - name: gpt-4-turbo-2
            - name: gpt-4-turbo-3

    - name: Another Endpoint
    ...

Regarding the feature to limit access for clients to specific endpoints/deployments, I think it's better to have this handled through an update of the LimitUsage plugin, and specifying access in the config where individual clients are defined. That however should be done once we have the deployment changes (as proposed above).

Any thoughts? Would the proposed config change work for you?

Hey @timoklimmer, just want to confirm, using the above configuration: If I sent a request to gpt-35-turbo, it would first hit "Some Endpoint" and attempt gpt-35-turbo-1, then once a rate limit is reached, move onto gpt-35-turbo-2, or move onto "Another Endpoint"?

Other than just confirming that, LGTM.

Hi @codylittle, your understanding is right. It would first try gpt-35-turbo-1, then gpt-35-turbo-2 etc., and if there no suitable deployment available at the endpoint, it would try the next endpoint and do the same on the next endpoint = if suitable deployments are defined, iterate those, and if no deployments are defined, just send the request to the endpoint.

Hi @codylittle, @krohm, @sterankin -- good news πŸŽ‰. I have added a virtual deployments feature yesterday, which finally gives the option to load balance across deployments. The following config snippet speaks more than 1,000 words (hopefully 😁)

aoai:
  endpoints:
    - name: Some Endpoint
      url: https://___.openai.azure.com/
      key: ___

      # optional
      virtual_deployments:
        - name: gpt-35-turbo  # can be any arbitrary name, deployment name to be used in requests to PowerProxy
          standins:
            - name: gpt-35-turbo-ptu  # name of existing deployment
              #optional: non_streaming_fraction: 0.8
            - name: gpt-35-turbo-paygo
        - name: gpt-4-turbo
          standins:
            - name: gpt-4-turbo-ptu
              #optional: non_streaming_fraction: 0.8
            - name: gpt-4-turbo-paygo

Now if you have certain deployments/models only available in certain endpoints/regions, just add whatever virtual deployments/deployments you want to offer from those respective endpoints. Only thing to watch out are endpoints in the config without virtual deployments configured. If an endpoint does not have virtual deployments configured, it will be sent any request, no matter which deployment is requested.

Besides, I have added a new plugin named "AllowDeployments", which can be used to restrict access to certain deployments. If enabled, clients can only use the virtual deployments specified for them.

clients:
  - name: Team 1
    description: An example team named 'Team 1'.
    key: ___
    deployments_allowed: gpt-35-turbo, gpt-4-turbo

I think/hope that addresses your feedback. If not, let me know. I will do some more testing and still need to update the documentation, but once that is done, I will release a new version soon. Enjoy πŸ™‚

Closing, issue fixes have been committed.

Thanks @timoklimmer juts been testing this out locally. Couple of questions.

  1. How does this work with max_tokens_per_minute_in_k for a client? It doesn't make sense to have the same rate limit for a given client for different models. For example you might want a client to have a much higher rate limit for embeddings or gpt-35 than gpt-4.

it would be great to set it at a model/deployment level for each client:

deployments_allowed: gpt-35-turbo [50], , gpt-4-turbo [10]

  1. We might have 3 endpoints, openai-001.com, openai-002.com and openai-003.com. However only 001 and 003 might have quota for a particular model. So any requests for that model we would not want to go to 002.com. Currently the endpoint non_streaming_fraction would still load balance across all 3 endpoints as per the endpoint setting. It would be ideal if you could set a list of available models per endpoint and a non_streaming_fraction for each deployment.

For example for gpt-4-turbo-125-preview we might have a large quota in endpoint openai-001.com, none in openai-002.com and a smaller quota in openai-003.com

So we might want a non_streaming_fraction of 0.7, 0, 0.3 for this particular model. Similar to virtual deployments, but across endpoints rather than within a single endpoint.

Hope this makes sense.

Hi @sterankin, I suggest to take a look at the example config file config.example.yaml.

The max_tokens_per_minute_in_k setting belongs to the LimitUsage plugin and has already what you are asking for. In addition to setting TPM limits for all deployments, you can now also specify (virtual) deployment-specific limits, like so:

    max_tokens_per_minute_in_k:
      gpt-35-turbo: 20
      gpt-4-turbo: 5

whereby gpt-35-turbo and gpt-4-turbo is the name of a virtual deployment.

Load balancing/non-streaming-fraction for deployments rather than endpoints is also supported already.

From the example config:

      virtual_deployments:
        - name: gpt-35-turbo
          standins:
            - name: gpt-35-turbo-ptu
              non_streaming_fraction: 0.2
            - name: gpt-35-turbo-paygo
        - name: gpt-4-turbo
          standins:
            - name: gpt-4-turbo-ptu
              non_streaming_fraction: 0.2
            - name: gpt-4-turbo-paygo

If you specify virtual deployments with the same name in multiple endpoints, it will load balance across the deployments from all those endpoints.

Thanks @timoklimmer I had an older version of the example config on my local. This is great. Especially the clarification that if we have virtual named deployments on each endpoint then the load balancing for each deployment will be based on the non streaming fraction. πŸ‘ I thought the non-streaming fraction in a standin was used to load balance across the different standins in a single virutal deployment endpoint rather than across all endpoints.