Does sglang do automatic batching?

Question

vedantroy opened this issue 18 days ago · comments

If I hit an sglang server in parallel with 100 requests, will it automatically batch the requests to do as many in parallel as possible?

Vedant Roy · Answer 1 · Thu May 16 2024 04:27:15 GMT+0800 (China Standard Time)

I ask, because I see that when using asyncio to generate 100 parallel requests, I see the following output a lot:

new fill batch. #seq: 1. #cached_token: 330. #new_token: 2172. #remaining_req: 57. #running_req: 9. tree_cache_hit_rate: 69.16%

Which indicates to me that I'm mostly getting sequence-length 1 batches. Although, it's possible on my GPU, that is the maximum supported batch size.

Liangsheng Yin · Answer 2 · Thu May 16 2024 11:16:05 GMT+0800 (China Standard Time)

@vedantroy Sure, of course, we do automatically batching for both the prefilling and the decoding phases.

The prefilling batching has two constraints:

The tokens of the newly filled requests and the running requests should not exceed the memory, you just see the newly filled request here, maybe there are a lot of other requests decoding at the same time. They also occupy the GPU memory.
We do not prefill too many tokens at one time, it is better for us cache-aware scheduling and will not decrease the performance.