Keayoub / Custom-Rate-Limiter-API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Open AI Cost Gateway Pattern

Real-Time Capabilities:
Track Spending By Product (Cost Chargeback) for each and every Request
Rate Limit By Product based on spending Limits (429 Rate Limiting Response when Spending limit has been reached )

Architecture

AI Cost Gateway

Open AI Service, Real-Time Cost Tracking And Rate Limiting Per HTTP Request (by Product)

Picture1


Addtional Capabilities - Any Service, Rate Limiting based on Budget (by Product) and Event Hub Logging

Additional Capabilities:
Rate Limiting based on Budget Alerts

Logging via Event Hubs to Data Lake Hub

Picture2


High Level Architecture of all Features in the repo


Open AI Transactional Cost Tracking and Rate limiting
Budget Alert Rate Limiting
Event Hub Logging

AI Gateway

Streaming Capabilities

Streaming responses do not include Token Information, that must be calculated
Prompt Tokens are calcuated using Additional Python Function API wrapper that uses TikToken :

https://github.com/awkwardindustries/dossier/tree/main/samples/open-ai/tokenizer/azure-function-python-v2

Methods

  1. Create
  2. Update
  3. Budget Alert Endpoint
  4. GetAll
  5. GetById

AOAI Swagger

Repo:
azure-rest-api-specs/specification/cognitiveservices/data-plane/AzureOpenAI/

JSON Repo: https://github.com/Azure/azure-rest-api-specs/blob/main/specification/cognitiveservices/data-plane/AzureOpenAI/inference/stable/2023-05-15/inference.json

JSON File URI: https://raw.githubusercontent.com/Azure/azure-rest-api-specs/main/specification/cognitiveservices/data-plane/AzureOpenAI/inference/stable/2023-05-15/inference.json


Budget Alerts

Latency:
Cost and usage data is typically available within 8-24 hours and budgets are evaluated against these costs every 24 hours.


Documentation:
https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/tutorial-acm-create-budgets

FAQ

Cost API:
Attempted this but Proved to be Overly Complicated. Cost and usage data is typically available within 8-24 hours. would have to create a polling mechanism to call Cost API for each resource to be monitored

Streaming Responses:
when "Stream" : true added to JSON payload, No Token information is provided by Open AI Service.
Prompt Tokens are calculated using a Python Function (PyTokenizer) that wraps a BPE Tokenizer library TikToken
Completion Tokens are calculated by counting the SSE responses and subtracting 2

Granularity of Cost Tracking:
Solution uses APIM Product Subscription Keys but can also be used against individual ID's, header value, etc

About

License:MIT License


Languages

Language:C# 68.8%Language:HCL 30.3%Language:PostScript 0.9%