sgl-project / sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does sglang do automatic batching?

vedantroy opened this issue · comments

If I hit an sglang server in parallel with 100 requests, will it automatically batch the requests to do as many in parallel as possible?

I ask, because I see that when using asyncio to generate 100 parallel requests, I see the following output a lot:

new fill batch. #seq: 1. #cached_token: 330. #new_token: 2172. #remaining_req: 57. #running_req: 9. tree_cache_hit_rate: 69.16%

Which indicates to me that I'm mostly getting sequence-length 1 batches. Although, it's possible on my GPU, that is the maximum supported batch size.

@vedantroy Sure, of course, we do automatically batching for both the prefilling and the decoding phases.

The prefilling batching has two constraints:

  1. The tokens of the newly filled requests and the running requests should not exceed the memory, you just see the newly filled request here, maybe there are a lot of other requests decoding at the same time. They also occupy the GPU memory.
  2. We do not prefill too many tokens at one time, it is better for us cache-aware scheduling and will not decrease the performance.