Metrabs TensorRT

Question

Metrabs TensorRT

tobibaum opened this issue 3 years ago · comments

Tobias Baumgartner commented 3 years ago

Hey István,

congrats on these great results and thanks for providing an easy-to-use way to run your models, exceptional work :)
I really like the result I get and just like everyone else in the issues, I would like to run it in real-time.
My approach was to squeeze out some speed-ups using TensorRT and its new tf-trt capability. At least for the resnet-style models, I'd expect a speed-up on the order of 10x. According to Nvidia the same should hold true for efficientnet-type models.

A tensorflow SavedModel can directly be optimized and converted into a TensorRT model using just a few lines of code:

import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt

converter = trt.TrtGraphConverterV2(input_saved_model_dir='models/eff2s_y4_short_sig')
converter.convert()
converter.save('models/eff2s_y4_trt')

In order for this conversion to know what to do, a default signature needs to be defined.
This can be achieved with the following:

import tensorflow as tf
model_folder = 'models/metrabs_eff2s_y4/'
out_fold='models/eff2s_y4_short_sig'
model = tf.saved_model.load(model_folder)

@tf.function()
def my_predict(my_prediction_inputs, **kwargs):
    prediction = model.detect_poses(my_prediction_inputs)
    return {"prediction": prediction['poses3d']}

my_signatures = my_predict.get_concrete_function(
   my_prediction_inputs=tf.TensorSpec([None,None, 3], dtype=tf.dtypes.uint8, name="image"))

tf.saved_model.save(model, out_fold, signatures=my_signatures)

(coincidentally, this might be a solution to the tensorflow-lite question in the issues? I haven't tried it, but just a hunch.)

Unfortunately, the conversion segfaults :D I know that this is rather an issue on Nvidia's side, but maybe we can still get this to work. I suspect that the augmentations you perform on the model in the Packaging Model section of your readme might be throwing tf-trt off.
Next, I tried to investigate this issue a little further by trying to look under the hood of the packaged SavedModel. I used
tensorflow's import_pb_to_tensorboard.py and tried to inspect the result in tensorboard.

$ python import_pb_to_tensorboard.py --model_dir models/eff2s_y4_short_sig/saved_model.pb --log_dir log
$ tensorboard --logdir log

Unfortunately again, tensorboard was not capable of displaying the computation graph and I suspect the reason is again the usage of tf.functions, but I am not sure.

What I would like to try is to convert one of your trained metrabs-models into TensorRT and take a look at the speed-up. Would it be possible for you to share a checkpoint file? or the un-augmented SavedModel as exported here: https://github.com/isarandi/metrabs/blob/master/src/main.py#L242 ? Maybe for metrabs_eff2l_y4, metrabs_eff2s_y4, metrabs_rn152_y4, and metrabs_rn18_y4 to see and compare how backbone and depth affect the inference time?

gao123qiang commented 3 years ago

good job

Tobias Baumgartner · Answer 1 · Thu Dec 16 2021 01:53:37 GMT+0800 (China Standard Time)

I run experiments on an RTX3090 and timed the following from a video with 1 person running on a treadmill (1000 frames = 33s)
Just like in issue #25, here some current timings for varying batchsizes:

bbone	det	bs	load	full_time	tfirst (batch)	tmean (batch)	t/sample
eff2l	y4	1	79.35	82.27	33221.14	149.97	149.97
eff2l	y4	8	73.74	108.03	31397.86	279.74	34.97
eff2l	y4	16	73.74	75.5	466.06	480.87	30.05
eff2l	y4	32	79.21	82.64	938.29	885.90	27.68
eff2l	y4	64	79.21	93.92	1853.03	1801.76	28.15
eff2l	y4	128	79.21	129.32	36806.45	3976.91	31.07
eff2l_360	y4	1	72.96	78.12	29512.30	148.57	148.57
eff2l_360	y4	8	67.72	102.14	29740.10	293.71	36.71
eff2l_360	y4	16	67.72	73.92	444.92	493.19	30.82
eff2l_360	y4	32	70.2	85.18	955.17	910.69	28.46
eff2l_360	y4	64	70.2	90.43	1857.24	1856.95	29.01
eff2l_360	y4	128	70.2	121.25	33749.48	4055.34	31.68
eff2m	y4	1	57.41	69.7	22034.85	129.54	129.54
eff2m	y4	8	69.28	102.92	22930.01	345.51	43.19
eff2m	y4	16	69.28	73.59	371.38	391.84	24.49
eff2m	y4	32	75.41	79.76	804.21	717.96	22.44
eff2m	y4	64	75.41	90.09	1455.46	1485.78	23.22
eff2m	y4	128	75.41	115.63	23679.09	3401.27	26.57
eff2s	y4	1	39.24	64.93	16928.48	119.87	119.87
eff2s	y4	8	37.01	87.53	17234.22	208.43	26.05
eff2s	y4	8	46.17		39025.55	210.69	26.34
eff2s	y4	16	37.01	70.84	444.42	341.08	21.32
eff2s	y4	32	42.79	77.82	863.95	642.54	20.08
eff2s	y4	64	42.79	89.46	1614.69	1343.52	20.99
eff2s	y4	128	42.79	113.12	26511.61	2976.25	23.25
mob3l	y4	1	28.05	78.2	21949.89	95.03	95.03
mob3l	y4	8	32.06	86.01	19962.75	165.93	20.74
mob3l	y4	16	32.06	81.02	244.84	260.80	16.30
mob3l	y4	32	27.17	79.88	597.45	498.05	15.56
mob3l	y4	64	27.17	91.02	1159.92	1111.80	17.37
mob3l	y4	128	27.17	118.33	22417.67	2577.09	20.13
mob3l	y4t	1	21.57	47.48	9542.37	52.28	52.28
mob3l	y4t	8	21.03	72.27	9644.42	131.21	16.40
mob3l	y4t	16	21.03	66.23	190.44	169.23	10.58
mob3l	y4t	32	21.61	71.97	392.98	350.23	10.94
mob3l	y4t	64	21.61	82.26	676.51	717.93	11.22
mob3l	y4t	128	21.61	90.92	11668.84	1856.07	14.50
mob3s	y4	1	24.16		52083.48	10969.11	10969.11
mob3s	y4	8	23.7	99.66	22361.51	158.68	19.83
mob3s	y4	16	23.7	76.41	253.11	260.37	16.27
mob3s	y4	32	23.7	81.05	510.25	498.53	15.58
mob3s	y4t	1	15.92	51.66	15179.46	50.78	50.78
mob3s	y4t	8	15.84	82.27	14242.68	124.19	15.52
mob3s	y4t	16	15.84	60.38	160.62	173.93	10.87
mob3s	y4t	32	15.74	73.8	333.28	337.94	10.56
mob3s	y4t	64	15.74	83.16	606.98	718.15	11.22
mob3s	y4t	128	15.74	100.41	16325.95	1875.85	14.66
rn101	y4	1	45.55	69.32	23703.08	113.88	113.88
rn101	y4	8	42.32	94.27	23748.03	220.17	27.52
rn101	y4	16	42.32	73.14	315.59	354.65	22.17
rn101	y4	32	41.79	79.59	724.29	651.06	20.35
rn101	y4	64	41.79	89.47	1337.07	1321.07	20.64
rn101	y4	128	41.79	113.91	27209.60	3061.64	23.92
rn152	y4	1	56.9	77.51	30653.02	127.11	127.11
rn152	y4	8	54.83	106.07	30717.84	225.73	28.22
rn152	y4	16	54.83	75.51	361.81	376.26	23.52
rn152	y4	32	54.83	72.16	1059.79	694.71	21.71
rn18	y4	1	21.07	75.61	21613.34	90.91	90.91
rn18	y4	8	20.3	93.12	21005.99	157.36	19.67
rn18	y4	16	20.3	73.78	246.42	256.46	16.03
rn18	y4	32	20.24	80.91	531.65	489.85	15.31
rn18	y4	64	20.24	91.2	1011.25	1059.32	16.55
rn18	y4	128	20.24	113.03	23795.51	2535.79	19.81
rn34	y4	1	28.04	64.76	20548.91	97.98	97.98
rn34	y4	8	28.28	94.69	19514.33	171.81	21.48
rn34	y4	16	28.28	72.13	250.87	270.24	16.89
rn34	y4	32	28.37	78.21	587.16	526.36	16.45
rn34	y4	64	28.37	93.74	1143.86	1150.37	17.97
rn34	y4	128	28.37	112.3	21840.20	2724.95	21.29
rn50	y4	1	36.97	69.86	25330.24	99.48	99.48
rn50	y4	8	35.58	89.12	23896.04	179.18	22.40
rn50	y4	16	35.58	80.06	289.07	299.34	18.71
rn50	y4	32	35.58	78.47	605.76	567.36

(full_time and load in s, rest in ms)

As was expected: batching up the computations leads to significant speed-up. Unfortunately this will not be feasible with low latency in a real-time processing fashion. The fastest model for batchsize=1 was mob3s_y4t with 50ms. I would like to get below 30ms or even 15ms using TensorRT.

What do you think? Is this a good avenue to go down, or should I try something other than TensorRT?

Thanks!
Tobi

István Sárándi · Answer 2 · Thu Dec 16 2021 03:44:43 GMT+0800 (China Standard Time)

Thanks a lot for the detailed analysis! I will try to come back to this soon, but meanwhile what you can do is perhaps extract out the raw model from the convenient "packaged" one, and try to make that run under TensorRT. If that works, we can look at which part of the surrounding code is the culprit.

So you can try

model = tf.saved_model.load(...)
tf.saved_model.save(model.crop_model, 'somepath')

Then try to work with this newly saved model. You can check the interface of this resulting model at the API readme.

Part of the story may be that I saved the full model with options=tf.saved_model.SaveOptions(experimental_custom_gradients=True) or perhaps something with tf.raw_ops.ImageProjectiveTransformV3.

gao123qiang · Answer 3 · Tue Dec 28 2021 17:00:21 GMT+0800 (China Standard Time)

@tobibaum ,have you solved how to get the real time fps ?
i run it in ubuntu18+2080ti+tf2.6+cuda11.2+cudnn8.1+model.estimate_poses, the fps is 10f/s,
now i will change to use the gpu:3090 to test it in win10 and ubuntu18,
are you sucessful in TensorRT ?
if sucess, how ?

gao123qiang · Answer 4 · Tue Dec 28 2021 17:09:17 GMT+0800 (China Standard Time)

Thanks a lot for the detailed analysis! I will try to come back to this soon, but meanwhile what you can do is perhaps extract out the raw model from the convenient "packaged" one, and try to make that run under TensorRT. If that works, we can look at which part of the surrounding code is the culprit.

So you can try
model = tf.saved_model.load(...)
tf.saved_model.save(model.crop_model, 'somepath')
Then try to work with this newly saved model. You can check the interface of this resulting model at the API readme.

Part of the story may be that I saved the full model with options=tf.saved_model.SaveOptions(experimental_custom_gradients=True) or perhaps something with tf.raw_ops.ImageProjectiveTransformV3.

can you make a demo.py about it, i have seen the api, but it is too brief

Tobias Baumgartner · Answer 5 · Tue Dec 28 2021 17:27:15 GMT+0800 (China Standard Time)

@isarandi thanks for the great pointers! My current approach is to build the backbone model (effnet-l) using your original training code and then copying over the trained model weights from savedmodels in your model zoo.
With that, I am able to create an onnx version of the backbone and compare speed-ups in c++ using tiny-tensorrt.

I loosely follow the nvidia-tutorial to do so

load trained model weights

model = tf.saved_model.load(model_folder)
vars = model.crop_model.variables

create blank effnet model

from backbones.efficientnet.effnetv2_model import *
import backbones.efficientnet.effnetv2_utils as effnet_util
import tfu
effnet_util.set_batchnorm(effnet_util.BatchNormalization)
tfu.set_data_format('NHWC')
tfu.set_dtype(tf.float16)
mod = get_model('efficientnetv2-s', include_top=False, pretrained=False, with_endpoints=False)

copy over trained weights:

new_vars = mod_met.variables
var_dict = {v.name: [v, i] for i, v in enumerate(vars)}
var_dict_new = {v.name: [v, i] for i, v in enumerate(new_vars)}
inds = [var_dict[k][1] for k in var_dict_new.keys() if k in var_dict]
print(len(var_dict_new))
print(len(inds))

missing_keys = set(var_dict.keys()) - set(var_dict_new.keys())
rev_missing_keys = set(var_dict_new.keys()) - set(var_dict.keys())

print(missing_keys)
print(rev_missing_keys)
for m in missing_keys:
    d = var_dict[m][0]
    print(d.name, d.shape)
pick_vars = [vars[i] for i in inds]
print(len(pick_vars))

mod_met.set_weights(pick_vars)

save model with proper signature

@tf.function()
def my_predict(my_prediction_inputs, **kwargs):

    prediction = model([my_prediction_inputs], training=False)
    return {"prediction": prediction}

my_signatures = my_predict.get_concrete_function(
   my_prediction_inputs=tf.TensorSpec([256, 256, 3], dtype=tf.dtypes.float32, name="image")
)

tf.saved_model.save(mod, out_fold, signatures=my_signatures)

convert to onnx (install tf2onnx first)

$ python -m tf2onnx.convert  --saved-model effnet_raw_sig --output effnet.onnx

optimize runtime plan according to nvidia-tutorial
run inference in c++ with tiny-tensorrt.

@gao123qiang I will update this issue, once I had some more success on the full pipeline. The above is just my current approach to determine whether the efficientnet backbone can be sped up a sufficient amount w/ tensorrt. from my preliminary tests (no guarantee) I get the following timings:
c++ API just the efficientnet backbone (256x256x3 -> 1x8x8x1280) in
tensorflow: 25~30ms per image
tensorrt: 3~5ms.

There is of course some more overhead with image preprocessing and output post processing (plus running the metrabs head), but overall I think this looks promising.

I will update the issue, once I get a real-time system running (or abandoned the approach).
Please feel free to share any thoughts or experiments you perform to speed this whole thing up :)

gao123qiang · Answer 6 · Tue Dec 28 2021 18:41:48 GMT+0800 (China Standard Time)

thank you, i will test it.
Looking forward to your update !

gao123qiang · Answer 7 · Wed Dec 29 2021 14:50:40 GMT+0800 (China Standard Time)

@tobibaum ,
i have seen the nvidia-tutorial and you provided,
main: savedModel --> .pb --> .onnx
the mod_met in step2 is keras.models.clone_model(model).crop_model ?
in step3, the input shape is 2562563, can i change the size ?

Tobias Baumgartner · Answer 8 · Wed Dec 29 2021 15:30:23 GMT+0800 (China Standard Time)

Hey @gao123qiang ,

if I understand correctly, the backbones of the overall models are trained on 256x256 patches. Since they are not fully convolutional nets, they therefore depend on the input being of that same size. in the packaging section of the readme, the author first describes how the metrabs core model is packaged to run on 256x256 images.

Also, you cannot feed your raw images into this, you first need to run a detector to determine the location of ppl in your scene (and normalize the input).
My above comments were a guideline on how to dig into the metrabs model and investigate for potential speed-ups. You will not be able to get reasonable results by following my steps.

Basti110 · Answer 9 · Mon Jan 03 2022 19:25:11 GMT+0800 (China Standard Time)

@tobibaum,
your aproach looks very promising. Is there any progress in your approach?

Tobias Baumgartner · Answer 10 · Mon Jan 03 2022 23:52:51 GMT+0800 (China Standard Time)

I finally have a working version of my approach. I had to perform some major surgery on the provided models, but I confirmed, that the results stay approximately the same, just at higher speeds. (I use a 3rd party yolo, thus the crops and everything downstream will slightly differ). Here's what I did:

Split the saved_model into:

detection (accelerate with TensorRT)
backbone (accelerate with TensorRT)
metrabs head (run as is)

I was then able to convert the first two into tensorrt models. the metrabs head have some operations in them that tensorrt does not like, but since they are just the last layer and some function wrappers, they run very fast in tensorflow using the C-API.
I then wrapped these three models in C++ code and get the following timings on an RTX2070:

model	fps
efficientnet-l	37
efficientnet-s	50
resnet50	58
resnet152	45

check it out:
https://github.com/tobibaum/metrabs_trt

@isarandi it would be great, if you could have a look over my approach and check whether I made any breaking mistakes. thanks!!

Basti110 · Answer 11 · Thu Jan 06 2022 22:13:57 GMT+0800 (China Standard Time)

@tobibaum Thank you very much for you great how to!

With your help, I was able to generate the onnx and plan file of the backbone. Until now, I did not the last step (Compile Your C++ Version). I would prefer a python version because otherwise I have to build a cpp extension for that. For my python version I used the engine and inference code of the Nvidia tutorial. But now I get the error

[01/06/2022-14:52:12] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7518, GPU 9372 (MiB)
[01/06/2022-14:52:12] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 7518, GPU 9382 (MiB)
[01/06/2022-14:52:12] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +27, now: CPU 0, GPU 248 (MiB)
[01/06/2022-14:52:12] [TRT] [E] 1: Unexpected exception

Cuda, Tensorrt and all other needed libraries are natively installed on my system. The outgoing data consists completely of zeros and there is no error in the python code. Do you have experience with that?
Does this approach also work if I set a dynamic input size in the signature? In my use case I have to change my batch size dynamically during runtime. Sorry for these questions but tensorrt is completely new to me.

In the next step, I will try your c++ version. Maybe that works for me.

Tobias Baumgartner · Answer 12 · Fri Jan 07 2022 22:39:46 GMT+0800 (China Standard Time)

Hey @Basti110 ,

unfortunately I cannot tell what the error might be here. could you try to run the nvidia inference engine with their models to pinpoint whether the problem is in the setup or the compiled model?

cheers!

Basti110 · Answer 13 · Sat Jan 08 2022 15:27:29 GMT+0800 (China Standard Time)

Hey,

I think it is only a problem in the python enviroment. The C++ version with tiny-tensorrt works! Thank You.

Tobias Baumgartner · Answer 14 · Sat Jan 08 2022 16:21:40 GMT+0800 (China Standard Time)

Thanks for testing my implementation :)