How to achieve 25fps inference speed

Question

How to achieve 25fps inference speed

initialneil opened this issue 6 months ago · comments

Great work!

Here's some profiling on the inference speed on a 3090.

[CUDA Timer] raft_stereo takes 26.7794 ms
[CUDA Timer] flow2gsparms takes 80.8899 ms
[CUDA Timer] .... flow2gsparms/gs_parm_regresser takes 77.3901 ms
[CUDA Timer] render takes 4.7777 ms

With the given testing real images, the gs_parm_regresser alone takes 77ms, not to mention other parts like the raft_stereo. Could you please give some suggestions on speeding up?
How was the 25fps claimed in the paper achieved?

Sawyer · Answer 1 · Sun Feb 04 2024 15:53:05 GMT+0800 (China Standard Time)

The depth estimator and gs parameter regresser are implemented using TensorRT with fp16. Also, Robust Video Matting in TensorRT is needed in real-world application. However, the acceleration implementation in C++ will not include in this project.

Zihao Wang · Answer 2 · Tue May 14 2024 00:23:20 GMT+0800 (China Standard Time)

Thanks for your great work! Do the gs parameter regresser module models need to be retrained in python using fp16 mode? When using official pretrained models directly, there seems to be numerical overflow issues during the inference process in C++, which leads to incorrect inferences of attributes such as opacity.

Sawyer · Answer 3 · Tue May 14 2024 14:35:14 GMT+0800 (China Standard Time)

Thanks for your great work! Do the gs parameter regresser module models need to be retrained in python using fp16 mode? When using official pretrained models directly, there seems to be numerical overflow issues during the inference process in C++, which leads to incorrect inferences of attributes such as opacity.

If you find numerical overflow in fp16, you can modify these lines as

scale_out = torch.clamp_max(self.scale_head(out), 100.) / 10000.
opacity_out = torch.sigmoid(self.opacity_head(out) / 100.)

to make the prediction numerically larger. Or you can implement the gs parameter regresser in fp32. However, the training process still works under fp32 or mixed precision.