Well, this repo is for a pure cpp file to describe the network architecture and use cudnn to accelerate the workflow. I have tried as many as possible methods to accelerate the execution, but this code is still not as fast as TensorRT, which also implements memory reuse to decrease total memory footprint.