isarandi / metrabs

Estimate absolute 3D human poses from RGB images.

Home Page:https://arxiv.org/abs/2007.07227

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metrabs TensorRT

tobibaum opened this issue · comments

Hey István,

congrats on these great results and thanks for providing an easy-to-use way to run your models, exceptional work :)
I really like the result I get and just like everyone else in the issues, I would like to run it in real-time.
My approach was to squeeze out some speed-ups using TensorRT and its new tf-trt capability. At least for the resnet-style models, I'd expect a speed-up on the order of 10x. According to Nvidia the same should hold true for efficientnet-type models.

A tensorflow SavedModel can directly be optimized and converted into a TensorRT model using just a few lines of code:

import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt

converter = trt.TrtGraphConverterV2(input_saved_model_dir='models/eff2s_y4_short_sig')
converter.convert()
converter.save('models/eff2s_y4_trt')

In order for this conversion to know what to do, a default signature needs to be defined.
This can be achieved with the following:

import tensorflow as tf
model_folder = 'models/metrabs_eff2s_y4/'
out_fold='models/eff2s_y4_short_sig'
model = tf.saved_model.load(model_folder)

@tf.function()
def my_predict(my_prediction_inputs, **kwargs):
    prediction = model.detect_poses(my_prediction_inputs)
    return {"prediction": prediction['poses3d']}

my_signatures = my_predict.get_concrete_function(
   my_prediction_inputs=tf.TensorSpec([None,None, 3], dtype=tf.dtypes.uint8, name="image"))

tf.saved_model.save(model, out_fold, signatures=my_signatures)

(coincidentally, this might be a solution to the tensorflow-lite question in the issues? I haven't tried it, but just a hunch.)

Unfortunately, the conversion segfaults :D I know that this is rather an issue on Nvidia's side, but maybe we can still get this to work. I suspect that the augmentations you perform on the model in the Packaging Model section of your readme might be throwing tf-trt off.
Next, I tried to investigate this issue a little further by trying to look under the hood of the packaged SavedModel. I used
tensorflow's import_pb_to_tensorboard.py and tried to inspect the result in tensorboard.

$ python import_pb_to_tensorboard.py --model_dir models/eff2s_y4_short_sig/saved_model.pb --log_dir log
$ tensorboard --logdir log

Unfortunately again, tensorboard was not capable of displaying the computation graph and I suspect the reason is again the usage of tf.functions, but I am not sure.

What I would like to try is to convert one of your trained metrabs-models into TensorRT and take a look at the speed-up. Would it be possible for you to share a checkpoint file? or the un-augmented SavedModel as exported here: https://github.com/isarandi/metrabs/blob/master/src/main.py#L242 ? Maybe for metrabs_eff2l_y4, metrabs_eff2s_y4, metrabs_rn152_y4, and metrabs_rn18_y4 to see and compare how backbone and depth affect the inference time?

I run experiments on an RTX3090 and timed the following from a video with 1 person running on a treadmill (1000 frames = 33s)
Just like in issue #25, here some current timings for varying batchsizes:

bbone det bs load full_time tfirst (batch) tmean (batch) t/sample
eff2l y4 1 79.35 82.27 33221.14 149.97 149.97
eff2l y4 8 73.74 108.03 31397.86 279.74 34.97
eff2l y4 16 73.74 75.5 466.06 480.87 30.05
eff2l y4 32 79.21 82.64 938.29 885.90 27.68
eff2l y4 64 79.21 93.92 1853.03 1801.76 28.15
eff2l y4 128 79.21 129.32 36806.45 3976.91 31.07
eff2l_360 y4 1 72.96 78.12 29512.30 148.57 148.57
eff2l_360 y4 8 67.72 102.14 29740.10 293.71 36.71
eff2l_360 y4 16 67.72 73.92 444.92 493.19 30.82
eff2l_360 y4 32 70.2 85.18 955.17 910.69 28.46
eff2l_360 y4 64 70.2 90.43 1857.24 1856.95 29.01
eff2l_360 y4 128 70.2 121.25 33749.48 4055.34 31.68
eff2m y4 1 57.41 69.7 22034.85 129.54 129.54
eff2m y4 8 69.28 102.92 22930.01 345.51 43.19
eff2m y4 16 69.28 73.59 371.38 391.84 24.49
eff2m y4 32 75.41 79.76 804.21 717.96 22.44
eff2m y4 64 75.41 90.09 1455.46 1485.78 23.22
eff2m y4 128 75.41 115.63 23679.09 3401.27 26.57
eff2s y4 1 39.24 64.93 16928.48 119.87 119.87
eff2s y4 8 37.01 87.53 17234.22 208.43 26.05
eff2s y4 8 46.17 39025.55 210.69 26.34
eff2s y4 16 37.01 70.84 444.42 341.08 21.32
eff2s y4 32 42.79 77.82 863.95 642.54 20.08
eff2s y4 64 42.79 89.46 1614.69 1343.52 20.99
eff2s y4 128 42.79 113.12 26511.61 2976.25 23.25
mob3l y4 1 28.05 78.2 21949.89 95.03 95.03
mob3l y4 8 32.06 86.01 19962.75 165.93 20.74
mob3l y4 16 32.06 81.02 244.84 260.80 16.30
mob3l y4 32 27.17 79.88 597.45 498.05 15.56
mob3l y4 64 27.17 91.02 1159.92 1111.80 17.37
mob3l y4 128 27.17 118.33 22417.67 2577.09 20.13
mob3l y4t 1 21.57 47.48 9542.37 52.28 52.28
mob3l y4t 8 21.03 72.27 9644.42 131.21 16.40
mob3l y4t 16 21.03 66.23 190.44 169.23 10.58
mob3l y4t 32 21.61 71.97 392.98 350.23 10.94
mob3l y4t 64 21.61 82.26 676.51 717.93 11.22
mob3l y4t 128 21.61 90.92 11668.84 1856.07 14.50
mob3s y4 1 24.16 52083.48 10969.11 10969.11
mob3s y4 8 23.7 99.66 22361.51 158.68 19.83
mob3s y4 16 23.7 76.41 253.11 260.37 16.27
mob3s y4 32 23.7 81.05 510.25 498.53 15.58
mob3s y4t 1 15.92 51.66 15179.46 50.78 50.78
mob3s y4t 8 15.84 82.27 14242.68 124.19 15.52
mob3s y4t 16 15.84 60.38 160.62 173.93 10.87
mob3s y4t 32 15.74 73.8 333.28 337.94 10.56
mob3s y4t 64 15.74 83.16 606.98 718.15 11.22
mob3s y4t 128 15.74 100.41 16325.95 1875.85 14.66
rn101 y4 1 45.55 69.32 23703.08 113.88 113.88
rn101 y4 8 42.32 94.27 23748.03 220.17 27.52
rn101 y4 16 42.32 73.14 315.59 354.65 22.17
rn101 y4 32 41.79 79.59 724.29 651.06 20.35
rn101 y4 64 41.79 89.47 1337.07 1321.07 20.64
rn101 y4 128 41.79 113.91 27209.60 3061.64 23.92
rn152 y4 1 56.9 77.51 30653.02 127.11 127.11
rn152 y4 8 54.83 106.07 30717.84 225.73 28.22
rn152 y4 16 54.83 75.51 361.81 376.26 23.52
rn152 y4 32 54.83 72.16 1059.79 694.71 21.71
rn18 y4 1 21.07 75.61 21613.34 90.91 90.91
rn18 y4 8 20.3 93.12 21005.99 157.36 19.67
rn18 y4 16 20.3 73.78 246.42 256.46 16.03
rn18 y4 32 20.24 80.91 531.65 489.85 15.31
rn18 y4 64 20.24 91.2 1011.25 1059.32 16.55
rn18 y4 128 20.24 113.03 23795.51 2535.79 19.81
rn34 y4 1 28.04 64.76 20548.91 97.98 97.98
rn34 y4 8 28.28 94.69 19514.33 171.81 21.48
rn34 y4 16 28.28 72.13 250.87 270.24 16.89
rn34 y4 32 28.37 78.21 587.16 526.36 16.45
rn34 y4 64 28.37 93.74 1143.86 1150.37 17.97
rn34 y4 128 28.37 112.3 21840.20 2724.95 21.29
rn50 y4 1 36.97 69.86 25330.24 99.48 99.48
rn50 y4 8 35.58 89.12 23896.04 179.18 22.40
rn50 y4 16 35.58 80.06 289.07 299.34 18.71
rn50 y4 32 35.58 78.47 605.76 567.36

(full_time and load in s, rest in ms)

As was expected: batching up the computations leads to significant speed-up. Unfortunately this will not be feasible with low latency in a real-time processing fashion. The fastest model for batchsize=1 was mob3s_y4t with 50ms. I would like to get below 30ms or even 15ms using TensorRT.

What do you think? Is this a good avenue to go down, or should I try something other than TensorRT?

Thanks!
Tobi

Thanks a lot for the detailed analysis! I will try to come back to this soon, but meanwhile what you can do is perhaps extract out the raw model from the convenient "packaged" one, and try to make that run under TensorRT. If that works, we can look at which part of the surrounding code is the culprit.

So you can try

model = tf.saved_model.load(...)
tf.saved_model.save(model.crop_model, 'somepath')

Then try to work with this newly saved model. You can check the interface of this resulting model at the API readme.

Part of the story may be that I saved the full model with options=tf.saved_model.SaveOptions(experimental_custom_gradients=True) or perhaps something with tf.raw_ops.ImageProjectiveTransformV3.

@tobibaum ,have you solved how to get the real time fps ?
i run it in ubuntu18+2080ti+tf2.6+cuda11.2+cudnn8.1+model.estimate_poses, the fps is 10f/s,
now i will change to use the gpu:3090 to test it in win10 and ubuntu18,
are you sucessful in TensorRT ?
if sucess, how ?

Thanks a lot for the detailed analysis! I will try to come back to this soon, but meanwhile what you can do is perhaps extract out the raw model from the convenient "packaged" one, and try to make that run under TensorRT. If that works, we can look at which part of the surrounding code is the culprit.

So you can try

model = tf.saved_model.load(...)
tf.saved_model.save(model.crop_model, 'somepath')

Then try to work with this newly saved model. You can check the interface of this resulting model at the API readme.

Part of the story may be that I saved the full model with options=tf.saved_model.SaveOptions(experimental_custom_gradients=True) or perhaps something with tf.raw_ops.ImageProjectiveTransformV3.

can you make a demo.py about it, i have seen the api, but it is too brief

@isarandi thanks for the great pointers! My current approach is to build the backbone model (effnet-l) using your original training code and then copying over the trained model weights from savedmodels in your model zoo.
With that, I am able to create an onnx version of the backbone and compare speed-ups in c++ using tiny-tensorrt.

I loosely follow the nvidia-tutorial to do so

  1. load trained model weights
model = tf.saved_model.load(model_folder)
vars = model.crop_model.variables
  1. create blank effnet model
from backbones.efficientnet.effnetv2_model import *
import backbones.efficientnet.effnetv2_utils as effnet_util
import tfu
effnet_util.set_batchnorm(effnet_util.BatchNormalization)
tfu.set_data_format('NHWC')
tfu.set_dtype(tf.float16)
mod = get_model('efficientnetv2-s', include_top=False, pretrained=False, with_endpoints=False)
  1. copy over trained weights:
new_vars = mod_met.variables
var_dict = {v.name: [v, i] for i, v in enumerate(vars)}
var_dict_new = {v.name: [v, i] for i, v in enumerate(new_vars)}
inds = [var_dict[k][1] for k in var_dict_new.keys() if k in var_dict]
print(len(var_dict_new))
print(len(inds))

missing_keys = set(var_dict.keys()) - set(var_dict_new.keys())
rev_missing_keys = set(var_dict_new.keys()) - set(var_dict.keys())

print(missing_keys)
print(rev_missing_keys)
for m in missing_keys:
    d = var_dict[m][0]
    print(d.name, d.shape)
pick_vars = [vars[i] for i in inds]
print(len(pick_vars))

mod_met.set_weights(pick_vars)
  1. save model with proper signature
@tf.function()
def my_predict(my_prediction_inputs, **kwargs):

    prediction = model([my_prediction_inputs], training=False)
    return {"prediction": prediction}

my_signatures = my_predict.get_concrete_function(
   my_prediction_inputs=tf.TensorSpec([256, 256, 3], dtype=tf.dtypes.float32, name="image")
)

tf.saved_model.save(mod, out_fold, signatures=my_signatures)
  1. convert to onnx (install tf2onnx first)
$ python -m tf2onnx.convert  --saved-model effnet_raw_sig --output effnet.onnx
  1. optimize runtime plan according to nvidia-tutorial

  2. run inference in c++ with tiny-tensorrt.

@gao123qiang I will update this issue, once I had some more success on the full pipeline. The above is just my current approach to determine whether the efficientnet backbone can be sped up a sufficient amount w/ tensorrt. from my preliminary tests (no guarantee) I get the following timings:
c++ API just the efficientnet backbone (256x256x3 -> 1x8x8x1280) in
tensorflow: 25~30ms per image
tensorrt: 3~5ms.

There is of course some more overhead with image preprocessing and output post processing (plus running the metrabs head), but overall I think this looks promising.

I will update the issue, once I get a real-time system running (or abandoned the approach).
Please feel free to share any thoughts or experiments you perform to speed this whole thing up :)

thank you, i will test it.
Looking forward to your update !

@tobibaum ,
i have seen the nvidia-tutorial and you provided,
main: savedModel --> .pb --> .onnx
the mod_met in step2 is keras.models.clone_model(model).crop_model ?
in step3, the input shape is 256
256
3, can i change the size ?

Hey @gao123qiang ,

if I understand correctly, the backbones of the overall models are trained on 256x256 patches. Since they are not fully convolutional nets, they therefore depend on the input being of that same size. in the packaging section of the readme, the author first describes how the metrabs core model is packaged to run on 256x256 images.

Also, you cannot feed your raw images into this, you first need to run a detector to determine the location of ppl in your scene (and normalize the input).
My above comments were a guideline on how to dig into the metrabs model and investigate for potential speed-ups. You will not be able to get reasonable results by following my steps.

@tobibaum,
your aproach looks very promising. Is there any progress in your approach?

I finally have a working version of my approach. I had to perform some major surgery on the provided models, but I confirmed, that the results stay approximately the same, just at higher speeds. (I use a 3rd party yolo, thus the crops and everything downstream will slightly differ). Here's what I did:

Split the saved_model into:

  • detection (accelerate with TensorRT)
  • backbone (accelerate with TensorRT)
  • metrabs head (run as is)

I was then able to convert the first two into tensorrt models. the metrabs head have some operations in them that tensorrt does not like, but since they are just the last layer and some function wrappers, they run very fast in tensorflow using the C-API.
I then wrapped these three models in C++ code and get the following timings on an RTX2070:

model fps
efficientnet-l 37
efficientnet-s 50
resnet50 58
resnet152 45

check it out:
https://github.com/tobibaum/metrabs_trt

@isarandi it would be great, if you could have a look over my approach and check whether I made any breaking mistakes. thanks!!

good job

@tobibaum Thank you very much for you great how to!

With your help, I was able to generate the onnx and plan file of the backbone. Until now, I did not the last step (Compile Your C++ Version). I would prefer a python version because otherwise I have to build a cpp extension for that. For my python version I used the engine and inference code of the Nvidia tutorial. But now I get the error

[01/06/2022-14:52:12] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7518, GPU 9372 (MiB)
[01/06/2022-14:52:12] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 7518, GPU 9382 (MiB)
[01/06/2022-14:52:12] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +27, now: CPU 0, GPU 248 (MiB)
[01/06/2022-14:52:12] [TRT] [E] 1: Unexpected exception 

Cuda, Tensorrt and all other needed libraries are natively installed on my system. The outgoing data consists completely of zeros and there is no error in the python code. Do you have experience with that?
Does this approach also work if I set a dynamic input size in the signature? In my use case I have to change my batch size dynamically during runtime. Sorry for these questions but tensorrt is completely new to me.

In the next step, I will try your c++ version. Maybe that works for me.

Hey @Basti110 ,

unfortunately I cannot tell what the error might be here. could you try to run the nvidia inference engine with their models to pinpoint whether the problem is in the setup or the compiled model?

cheers!

Hey,

I think it is only a problem in the python enviroment. The C++ version with tiny-tensorrt works! Thank You.

Thanks for testing my implementation :)