adding token limits per client per model

Question

adding token limits per client per model

krahnikblis opened this issue 2 months ago · comments

hello! i've been getting this thing up and running on a VM between my team and apps and our Azure OAI service - so far it's working nicely! but, my resource groups and quotas and such mean i have wildly different token limits per model (5K/min on GPT-4 and 30K/min on GPT-3.5 and embedding models). so, i need for users to be able to be configured with limits per model. i made some adjustments to the config.local.yaml structure and the LimitUsage.py file, and things appear to be working as desired, so i thought i'd share and request the feature be implemented so the next time i git-pull for your latest enhancements i won't need to re-edit the code? i don't yet know how to use github PR features so i'm pasting in the relevant bits here. there's definitely a more elegant way to do this, but there's also a lot of nesting and subclassing and i just wanted to get things moving so this is how i did it:

in the config.local.yaml file, under each client, i added a models key, like so:

clients:
  - name: powerautomate
    description: for instances of http calls from PA flows
    key: derpyderpydoo
    max_tokens_per_minute_in_k: 1
    models:
    - name: gpt-4-32k
      max_tokens_per_minute_in_k: 1
    - name: gpt-35-turbo-16k
      max_tokens_per_minute_in_k: 6

leaving the existing max_tokens_per_minute_in_k means your structure is untouched and these changes would be backward-compatible with configs not having the models key.

in LimitUsage.py, inside of on_client_identified(self, routing_slip), i added routing_slip to the call to the tokens-per-client function call:

            self._set_cache_setting(
                f"LimitUsage-{client}-budget",
                self._get_max_tokens_per_minute_in_k_for_client(client, routing_slip),
            )

and then redefined the function itself to take that new parameter and get the model being used in the request, look that up against the client_settings which seamlessly populated the models list using your existing Configuration class:

    def _get_max_tokens_per_minute_in_k_for_client(self, client, routing_slip):
        """Return the number of maximum tokens per minute in thousands for the given client."""
        client_settings = self.app_configuration.get_client_settings(client)
        if client not in self.configured_max_tpms:
            if "max_tokens_per_minute_in_k" not in client_settings:
                raise ImmediateResponseException(
                    Response(
                        content=(
                            f"Configuration for client '{client}' misses a "
                            "'max_tokens_per_minute_in_k' setting. This needs to be set when the "
                            "LimitUsage plugin is enabled."
                        ),
                        status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
                    )
                )
            self.configured_max_tpms[client] = int(
                float(client_settings["max_tokens_per_minute_in_k"]) * 1000
            )
        client_models = client_settings.get("models")
        if client_models is not None and routing_slip["incoming_request_body_dict"] is not None:
            model = routing_slip["incoming_request_body_dict"].get("model") or routing_slip["incoming_request_body_dict"].get("model_name")
            if model is not None:
                cm = {m["name"]: m["max_tokens_per_minute_in_k"] for m in client_models}
                client_model_limit = cm.get(model)
                if client_model_limit is not None and client_model_limit > 0:
                    return int(float(client_model_limit * 1000))
        return self.configured_max_tpms[client]

the changes are of course including the routing_slip parameter, and the section beginning with client_models - if the request has the model param (as it should) and if the client has the models key in the settings and that model exists in the client's specific limits, the function returns the model-specific limit, otherwise it returns the class' existing configured_max_tpms for the client.
**edit: made some changes to where client_settings is collected and client_models is referenced

i've also kept some notes on how i set this up on Docker (it was a challenge as i'm relatively new to it), happy to share them as a write-up, and i'm also working on a LogUsageMessagesToJSON plugin since i want our usage histories to be searchable for analysis and building a knowledge graph... would be happy to share that plugin as well if you're interested, once turn all the bugs into features...

Timo Klimmer · Answer 1 · Fri May 03 2024 19:40:12 GMT+0800 (China Standard Time)

Hi @krahnikblis, thanks for sharing. Unfortunately, I cannot accept change requests other than pull requests, but I can take a look at your PR(s) once submitted. Thanks!

Timo Klimmer · Answer 2 · Fri May 17 2024 05:22:47 GMT+0800 (China Standard Time)

Hey @krahnikblis, quick update: I have extended the LimitUsage plugin in the main branch. You can now also configure things like:

    max_tokens_per_minute_in_k:
      gpt-35-turbo: 50
      gpt-4-turbo: 5

in addition to just

 max_tokens_per_minute_in_k: 20

I think that solves your issue. If not, please let me know. I will include the update in the next release.