Investigate issues related to the streaming mode

Question

Investigate issues related to the streaming mode

spikelu2016 opened this issue 7 months ago · comments

noticed that [streaming mode](https://platform.openai.com/docs/api-reference/streaming) for chat completions is much less fluent when using the proxy when compared to the normal OpenAI API. The normal API delivers many (5-10 I guess) chunks per second to the client while the proxy seems to update the response only once per second without caring about individual chunks. Would it be complicated to fix that?

(I only took at short look at the [implementation 1](https://github.com/bricks-cloud/BricksLLM/blob/325f1d88315411e75ac9aadf7c96b468b37eb66e/internal/server/web/proxy.go#L770-L826), maybe the buffer size is too large or synchronous cost estimation takes too much time? Of course I don’t have a deeper knowledge of your codebase, even though it appears nice to read :slight_smile: )