Does sglang do automatic batching?
vedantroy opened this issue · comments
If I hit an sglang server in parallel with 100 requests, will it automatically batch the requests to do as many in parallel as possible?
I ask, because I see that when using asyncio to generate 100 parallel requests, I see the following output a lot:
new fill batch. #seq: 1. #cached_token: 330. #new_token: 2172. #remaining_req: 57. #running_req: 9. tree_cache_hit_rate: 69.16%
Which indicates to me that I'm mostly getting sequence-length 1 batches. Although, it's possible on my GPU, that is the maximum supported batch size.
@vedantroy Sure, of course, we do automatically batching for both the prefilling and the decoding phases.
The prefilling batching has two constraints:
- The tokens of the newly filled requests and the running requests should not exceed the memory, you just see the newly filled request here, maybe there are a lot of other requests decoding at the same time. They also occupy the GPU memory.
- We do not prefill too many tokens at one time, it is better for us cache-aware scheduling and will not decrease the performance.