How to implement support for concurrency?

Question

How to implement support for concurrency?

CharlinChen opened this issue a month ago · comments

The phi3-qa.py file demonstrates how to use ONNX to accelerate single-user inference. I want to ask how to support concurrency for multiple users making requests simultaneously? Instantiating a generator for each user seems impractical. Using batch inputs is a predictable approach, but how can we incorporate the prompt for the second user during the streaming inference process?

Nat Kershaw (MSFT) · Answer 1 · Fri May 24 2024 07:40:25 GMT+0800 (China Standard Time)

Hi @CharlinChen, do you need to stream the output of each batch input? If not, you can batch multiple prompts and run them without streaming. You can see an example here: https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/model-generate.py

CharlinChen · Answer 2 · Fri May 24 2024 14:16:48 GMT+0800 (China Standard Time)

@natke Yes, I do hope to handle the second user's request during streaming output, because it is difficult to have multiple users hanging and waiting for the same batch input. In theory, interrupting the generation of the first user when receiving the second user's request, generating half of the tokens as prompts for the first user and merging the second user's request into batch input to continue inference seems feasible.