adding token limits per client per model
krahnikblis opened this issue · comments
hello! i've been getting this thing up and running on a VM between my team and apps and our Azure OAI service - so far it's working nicely! but, my resource groups and quotas and such mean i have wildly different token limits per model (5K/min on GPT-4 and 30K/min on GPT-3.5 and embedding models). so, i need for users to be able to be configured with limits per model. i made some adjustments to the config.local.yaml
structure and the LimitUsage.py
file, and things appear to be working as desired, so i thought i'd share and request the feature be implemented so the next time i git-pull for your latest enhancements i won't need to re-edit the code? i don't yet know how to use github PR features so i'm pasting in the relevant bits here. there's definitely a more elegant way to do this, but there's also a lot of nesting and subclassing and i just wanted to get things moving so this is how i did it:
in the config.local.yaml
file, under each client, i added a models
key, like so:
clients:
- name: powerautomate
description: for instances of http calls from PA flows
key: derpyderpydoo
max_tokens_per_minute_in_k: 1
models:
- name: gpt-4-32k
max_tokens_per_minute_in_k: 1
- name: gpt-35-turbo-16k
max_tokens_per_minute_in_k: 6
leaving the existing max_tokens_per_minute_in_k
means your structure is untouched and these changes would be backward-compatible with configs not having the models
key.
in LimitUsage.py
, inside of on_client_identified(self, routing_slip)
, i added routing_slip
to the call to the tokens-per-client function call:
self._set_cache_setting(
f"LimitUsage-{client}-budget",
self._get_max_tokens_per_minute_in_k_for_client(client, routing_slip),
)
and then redefined the function itself to take that new parameter and get the model being used in the request, look that up against the client_settings
which seamlessly populated the models
list using your existing Configuration
class:
def _get_max_tokens_per_minute_in_k_for_client(self, client, routing_slip):
"""Return the number of maximum tokens per minute in thousands for the given client."""
client_settings = self.app_configuration.get_client_settings(client)
if client not in self.configured_max_tpms:
if "max_tokens_per_minute_in_k" not in client_settings:
raise ImmediateResponseException(
Response(
content=(
f"Configuration for client '{client}' misses a "
"'max_tokens_per_minute_in_k' setting. This needs to be set when the "
"LimitUsage plugin is enabled."
),
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
)
)
self.configured_max_tpms[client] = int(
float(client_settings["max_tokens_per_minute_in_k"]) * 1000
)
client_models = client_settings.get("models")
if client_models is not None and routing_slip["incoming_request_body_dict"] is not None:
model = routing_slip["incoming_request_body_dict"].get("model") or routing_slip["incoming_request_body_dict"].get("model_name")
if model is not None:
cm = {m["name"]: m["max_tokens_per_minute_in_k"] for m in client_models}
client_model_limit = cm.get(model)
if client_model_limit is not None and client_model_limit > 0:
return int(float(client_model_limit * 1000))
return self.configured_max_tpms[client]
the changes are of course including the routing_slip
parameter, and the section beginning with client_models
- if the request has the model param (as it should) and if the client has the models
key in the settings and that model exists in the client's specific limits, the function returns the model-specific limit, otherwise it returns the class' existing configured_max_tpms
for the client.
**edit: made some changes to where client_settings
is collected and client_models
is referenced
i've also kept some notes on how i set this up on Docker (it was a challenge as i'm relatively new to it), happy to share them as a write-up, and i'm also working on a LogUsageMessagesToJSON
plugin since i want our usage histories to be searchable for analysis and building a knowledge graph... would be happy to share that plugin as well if you're interested, once turn all the bugs into features...
Hi @krahnikblis, thanks for sharing. Unfortunately, I cannot accept change requests other than pull requests, but I can take a look at your PR(s) once submitted. Thanks!
Hey @krahnikblis, quick update: I have extended the LimitUsage plugin in the main branch. You can now also configure things like:
max_tokens_per_minute_in_k:
gpt-35-turbo: 50
gpt-4-turbo: 5
in addition to just
max_tokens_per_minute_in_k: 20
I think that solves your issue. If not, please let me know. I will include the update in the next release.