microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to implement support for concurrency?

CharlinChen opened this issue · comments

The phi3-qa.py file demonstrates how to use ONNX to accelerate single-user inference. I want to ask how to support concurrency for multiple users making requests simultaneously? Instantiating a generator for each user seems impractical. Using batch inputs is a predictable approach, but how can we incorporate the prompt for the second user during the streaming inference process?

Hi @CharlinChen, do you need to stream the output of each batch input? If not, you can batch multiple prompts and run them without streaming. You can see an example here: https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/model-generate.py

@natke Yes, I do hope to handle the second user's request during streaming output, because it is difficult to have multiple users hanging and waiting for the same batch input. In theory, interrupting the generation of the first user when receiving the second user's request, generating half of the tokens as prompts for the first user and merging the second user's request into batch input to continue inference seems feasible.