triton-inference-server / openvino_backend

OpenVINO backend for Triton.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fully-working example with dynamic batching

mbahri opened this issue · comments

Hi

Thanks a lot for providing this backend. I have tried to use it and I have had some trouble getting Triton to load and run my OpenVINO models.

I found that the backend correctly attempts to load models if the files are just named model.bin and model.xml, in other cases the backend throws an exception. However, the main issue for me now is using the dynamic batching.

It would be very helpful if you could provide a fully working example of how to configure dynamic batching, with values for the different parameters that need to be set.

Related question: the backend doesn't support dynamic axes and one of the parameters mentioned for dynamic batching is about padding batches. Does this mean the backend will pad batches to the max batch size for now?

Once the PR #72 is merged it will be possible to use the models with dynamic shape. Note that with the dynamic shape on the model input, you don't need to use dynamic batching.
If you want to use arbitrary batch size or image resolution you will be able to do it with the model of shape like [-1,-1,-1,3].
If your goal is to improve throughput, you can use multiple instances with parallel execution (check the throughput mode example)

Hi, thanks for your reply. I think I might be a bit confused here, but could you explain why I don't need to use dynamic batching?

The way I thought it worked was that with dynamic batching enabled, Triton waits a predefined amount of time to group requests together in a batch, which would mean batch size could be 3, then 1, then 5, etc.

When using dynamic batching with other backends like ONNX, I've needed to set the input dimension to, for example, [3, 224, 224] - and have the model accept [-1, 3, 224, 224]

Does it work differently with OpenVINO?

I've used parallel model execution in combination with dynamic batching with ONNX before and needed to tune the number of threads each model instance could use to avoid overloading the CPU. Is it done differently with OpenVINO?

@mbahri You could use the dynamic batching but it will not of top efficiency - it will still use the batch padding. You can expect better throughput results by using parallel execution with multi instance configuration is setting the parameter NUM_STREAMS. That way you will not observe cpu overloading. The parameter NUM_STREAMS will handle threads management in parallel execution.
To sum up with the PR I mentioned you will be able to deploy models with shape [-1, 3, 224, 224] or [-1, 3, -1, -1]. If you want to improve the throughput for parallel execution from many clients, I recommend using several instances with several NUM_STREAMS (they should match).
Removing the padding will be dropped probably later but still similar throughput gain is expected from parallel execution.

thanks @dtrawins , so to confirm, with parallel model execution and setting NUM_STREAMS, I would just use a batch size of 1 for each model instance?