Individual preprocessing of each phase and different performance when running with TensorRT

Question

Individual preprocessing of each phase and different performance when running with TensorRT

steb6 opened this issue 2 years ago · comments

Hi! Thank you for your great job.
Since I am trying to use it for a robotic project, I need to run it in real time.
As done in issue #26, I extracted the BackBone and the Heads from eff2l model and I run everything as follows:

YoloV4 with TensorRT
BackBone with TensorRT
Heads with Tensorflow

Now I have to combine all the pieces together, but I can't understand from the code how to preprocess the BackBone input. I have tried the following:

Crop the bounding box of the human from the original frame and resize it to 256x256 (as written in the paper, but gives the worst performances)
Set all pixels outside the box as 0 (medium performances)
Extract the bounding box, pad it along the two sides of the shortest dimension such that the human is in the center of the image, then resize to 256x256 (best performances)

but since I am obtaining worst performances with respect to the original model (the skeleton is less accurate and noisy), I wanted to know if this happens because of the conversion to ONNX or if I am doing something wrong.
Thank you again!

István Sárándi · Answer 1 · Thu Feb 03 2022 18:36:56 GMT+0800 (China Standard Time)

Hi Stefano, you'd need to reproject the image with an appropriate homography so that the virtual camera is looking straight at the target person (with the principal point at the center of the 256x256 crop), then rotate back the result accordingly. You can check the code how it's done in the released model.

Stefano Berti · Answer 2 · Sat Feb 05 2022 00:47:44 GMT+0800 (China Standard Time)

Thank you! I finally managed to integrate the homography and now it is perfect. It runs at about 15 fps in python, and I still have to implement the pipeline 🚀

Basti110 · Answer 3 · Sat Feb 05 2022 06:03:37 GMT+0800 (China Standard Time)

Hey,
do you have some timings?

"save_multiperson_model" is the main model which does the homgraphy and passes the data to the metrabs model. Additionaly it takes 5 slighlty diffrent views of the image. At the end all results are merged together. "num_aug" is the number of rotated views per image. default is 5.

My goal is to run the metrabs model with a batchsize of 4 in ~40ms. In my case, the efficientnetv2-m TensorRT model + the metrabs heads runs in ~20-25ms with a batch size of 20. The batch size is 20 in the backbone if you predict the poses in 4 images with num_aug=5 in "save_multiperson_model" (5 slightly different rotated views per image). I wrote a python/c++ extension for tensorrt and wrapped it with a tf.py_function into the "save_multiperson_model" (I hope thats right. Tensorflow is new to me). The preprocessing in "save_multiperson_model" with the homogrpahy does also need ~20-25ms. Therefore, the complete model needs 40-50ms on my system. Before it was 50-60ms. But I think it is a little bit weird, that the pre processing step needs as long as the backbone+heads.

Summaray: With tensorflow, the backbone is 1,5-2x faster. But the preprocessing is the same. So the difference is only 10ms for me.