bricks-cloud / BricksLLM

đź”’ Enterprise-grade API gateway that helps you monitor and impose cost or rate limits per API key. Get fine-grained access control and monitoring per user, application, or environment. Supports OpenAI, Azure OpenAI, Anthropic, vLLM, and open-source LLMs.

Home Page:https://trybricks.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Investigate issues related to the streaming mode

spikelu2016 opened this issue · comments

noticed that [streaming mode](https://platform.openai.com/docs/api-reference/streaming) for chat completions is much less fluent when using the proxy when compared to the normal OpenAI API. The normal API delivers many (5-10 I guess) chunks per second to the client while the proxy seems to update the response only once per second without caring about individual chunks. Would it be complicated to fix that?

(I only took at short look at the [implementation 1](https://github.com/bricks-cloud/BricksLLM/blob/325f1d88315411e75ac9aadf7c96b468b37eb66e/internal/server/web/proxy.go#L770-L826), maybe the buffer size is too large or synchronous cost estimation takes too much time? Of course I don’t have a deeper knowledge of your codebase, even though it appears nice to read :slight_smile: )