missu123 / BEVFormer_tensorrt

BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BEVFormer on TensorRT

This repository is a deployment project of BEVFormer on TensorRT, supporting FP32/FP16/INT8 inference. Meanwhile, in order to improve the inference speed of BEVFormer on TensorRT, this project implements some TensorRT Ops that support nv_half and nv_half2. With the accuracy almost unaffected, the inference speed of the BEVFormer base can be increased by nearly four times, the engine size can be reduced by more than 90%, and the GPU memory usage can be saved by about 70%. In addition, the project also supports common 2D object detection models in MMDetection, which support INT8 Quantization and TensorRT Deployment with a small number of code changes.

Benchmarks

BEVFormer

BEVFormer PyTorch

Model Data Batch Size NDS/mAP FPS Size (MB) Memory (MB) Device
BEVFormer tiny
download
NuScenes 1 NDS: 0.354
mAP: 0.252
15.9 383 2167 RTX 3090
BEVFormer small
download
NuScenes 1 NDS: 0.478
mAP: 0.370
5.1 680 3147 RTX 3090
BEVFormer base
download
NuScenes 1 NDS: 0.517
mAP: 0.416
2.4 265 5435 RTX 3090

BEVFormer TensorRT with MMDeploy Plugins (Only Support FP32)

Model Data Batch Size Float/Int Quantization Method NDS/mAP FPS Size (MB) Memory (MB) Device
BEVFormer tiny NuScenes 1 FP32 - NDS: 0.354
mAP: 0.252
37.9 136 2159 RTX 3090
BEVFormer tiny NuScenes 1 FP16 - NDS: 0.354
mAP: 0.252
69.2
$\uparrow$ 83%
74
$\downarrow$ 46%
1729
$\downarrow$ 20%
RTX 3090
BEVFormer tiny NuScenes 1 FP32/INT8 PTQ max/per-tensor NDS: 0.305
mAP: 0.219
72.0
$\uparrow$ 90%
50
$\downarrow$ 63%
1745
$\downarrow$ 19%
RTX 3090
BEVFormer tiny NuScenes 1 FP16/INT8 PTQ max/per-tensor NDS: 0.305
mAP: 0.218
75.7
$\uparrow$ 100%
50
$\downarrow$ 63%
1727
$\downarrow$ 20%
RTX 3090
BEVFormer small NuScenes 1 FP32 - NDS: 0.478
mAP: 0.370
6.6 245 4663 RTX 3090
BEVFormer small NuScenes 1 FP16 - NDS: 0.478
mAP: 0.370
12.8
$\uparrow$ 94%
126
$\downarrow$ 49%
3719
$\downarrow$ 20%
RTX 3090
BEVFormer small NuScenes 1 FP32/INT8 PTQ max/per-tensor NDS: 0.471
mAP: 0.364
8.7
$\uparrow$ 32%
150
$\downarrow$ 39%
4195
$\downarrow$ 10%
RTX 3090
BEVFormer small NuScenes 1 FP16/INT8 PTQ max/per-tensor NDS: 0.470
mAP: 0.364
13.2
$\uparrow$ 100%
111
$\downarrow$ 55%
3661
$\downarrow$ 21%
RTX 3090
BEVFormer base * NuScenes 1 FP32 - NDS: 0.517
mAP: 0.416
1.5 1689 13893 RTX 3090
BEVFormer base NuScenes 1 FP16 - NDS: 0.517
mAP: 0.416
1.8
$\uparrow$ 20%
849
$\downarrow$ 50%
11865
$\downarrow$ 15%
RTX 3090
BEVFormer base * NuScenes 1 FP32/INT8 PTQ max/per-tensor NDS: 0.512
mAP: 0.410
1.7
$\uparrow$ 13%
1579
$\downarrow$ 7%
14019
$\uparrow$ 1%
RTX 3090
BEVFormer base NuScenes 1 FP16/INT8 PTQ max/per-tensor ERR - - - RTX 3090

* Out of Memory when onnx2trt with TensorRT-8.5.1.7, but they convert successfully with TensorRT-8.4.3.1. So the version of these engines is TensorRT-8.4.3.1.

BEVFormer TensorRT with Custom Plugins (Support nv_half and nv_half2)

FP16 Plugins with nv_half

Model Data Batch Size Float/Int Quantization Method NDS/mAP FPS/Improve Size (MB) Memory (MB) Device
BEVFormer tiny NuScenes 1 FP32 - NDS: 0.354
mAP: 0.252
41.4
$\uparrow$ 9%
135
$\downarrow$ 1%
1699
$\downarrow$ 21%
RTX 3090
BEVFormer tiny NuScenes 1 FP16 - NDS: 0.354
mAP: 0.252
76.8
$\uparrow$ 103%
73
$\downarrow$ 46%
1203
$\downarrow$ 44%
RTX 3090
BEVFormer tiny NuScenes 1 FP32/INT8 PTQ max/per-tensor NDS: 0.305
mAP: 0.219
78.9
$\uparrow$ 108%
48
$\downarrow$ 65%
1323
$\downarrow$ 39%
RTX 3090
BEVFormer tiny NuScenes 1 FP16/INT8 PTQ max/per-tensor NDS: 0.305
mAP: 0.219
89.0
$\uparrow$ 135%
48
$\downarrow$ 65%
1253
$\downarrow$ 42%
RTX 3090
BEVFormer small NuScenes 1 FP32 - NDS: 0.478
mAP: 0.370
7.0
$\uparrow$ 6%
246
$\downarrow$ 0%
2645
$\downarrow$ 43%
RTX 3090
BEVFormer small NuScenes 1 FP16 - NDS: 0.479
mAP: 0.370
16.3
$\uparrow$ 147%
124
$\downarrow$ 49%
1789
$\downarrow$ 62%
RTX 3090
BEVFormer small NuScenes 1 FP32/INT8 PTQ max/per-tensor NDS: 0.471
mAP: 0.364
10.3
$\uparrow$ 56%
149
$\downarrow$ 39%
2283
$\downarrow$ 51%
RTX 3090
BEVFormer small NuScenes 1 FP16/INT8 PTQ max/per-tensor NDS: 0.471
mAP: 0.364
16.5
$\uparrow$ 150%
110
$\downarrow$ 55%
2123
$\downarrow$ 54%
RTX 3090
BEVFormer base NuScenes 1 FP32 - NDS: 0.516
mAP: 0.416
3.2
$\uparrow$ 113%
283
$\downarrow$ 83%
5175
$\downarrow$ 63%
RTX 3090
BEVFormer base NuScenes 1 FP16 - NDS: 0.515
mAP: 0.415
6.5
$\uparrow$ 333%
144
$\downarrow$ 91%
3323
$\downarrow$ 76%
RTX 3090
BEVFormer base NuScenes 1 FP32/INT8 PTQ max/per-tensor NDS: 0.512
mAP: 0.410
4.2
$\uparrow$ 180%
173
$\downarrow$ 90%
5077
$\downarrow$ 63%
RTX 3090
BEVFormer base NuScenes 1 FP16/INT8 PTQ max/per-tensor NDS: 0.511
mAP: 0.409
5.7
$\uparrow$ 280%
135
$\downarrow$ 92%
4557
$\downarrow$ 67%
RTX 3090

FP16 Plugins with nv_half2

Model Data Batch Size Float/Int Quantization Method NDS/mAP FPS Size (MB) Memory (MB) Device
BEVFormer tiny NuScenes 1 FP16 - NDS: 0.354
mAP: 0.251
90.7
$\uparrow$ 139%
73
$\downarrow$ 46%
1211
$\downarrow$ 44%
RTX 3090
BEVFormer tiny NuScenes 1 FP16/INT8 PTQ max/per-tensor NDS: 0.305
mAP: 0.218
88.7
$\uparrow$ 134%
48
$\downarrow$ 65%
1253
$\downarrow$ 42%
RTX 3090
BEVFormer small NuScenes 1 FP16 - NDS: 0.478
mAP: 0.370
18.2
$\uparrow$ 176%
124
$\downarrow$ 49%
1843
$\downarrow$ 60%
RTX 3090
BEVFormer small NuScenes 1 FP16/INT8 PTQ max/per-tensor NDS: 0.471
mAP: 0.364
17.5
$\uparrow$ 165%
110
$\downarrow$ 55%
2013
$\downarrow$ 57%
RTX 3090
BEVFormer base NuScenes 1 FP16 - NDS: 0.515
mAP: 0.415
7.3
$\uparrow$ 387%
144
$\downarrow$ 91%
3323
$\downarrow$ 76%
RTX 3090
BEVFormer base NuScenes 1 FP16/INT8 PTQ max/per-tensor NDS: 0.512
mAP: 0.410
6.3
$\uparrow$ 320%
116
$\downarrow$ 93%
4543
$\downarrow$ 67%
RTX 3090

2D Detection Models

This project also supports common 2D object detection models in MMDetection with little modification. The following are deployment examples of YOLOx and CenterNet.

YOLOx

Model Data Framework Batch Size Float/Int Quantization Method mAP FPS Size (MB) Memory (MB) Device
YOLOx
download
COCO PyTorch 32 FP32 - mAP: 0.506
mAP_50: 0.685
mAP_75: 0.55
mAP_s: 0.32
mAP_m: 0.557
mAP_l: 0.667
1158 379 7617 RTX 3090
YOLOx COCO TensorRT 32 FP32 - mAP: 0.506
mAP_50: 0.685
mAP_75: 0.55
mAP_s: 0.32
mAP_m: 0.556
mAP_l: 0.667
11307 546 9943 RTX 3090
YOLOx COCO TensorRT 32 FP16 - mAP: 0.506
mAP_50: 0.685
mAP_75: 0.55
mAP_s: 0.32
mAP_m: 0.556
mAP_l: 0.668
29907 192 4567 RTX 3090
YOLOx COCO TensorRT 32 FP32/INT8 PTQ max/per-tensor mAP: 0.48
mAP_50: 0.673
mAP_75: 0.524
mAP_s: 0.293
mAP_m: 0.524
mAP_l: 0.644
24806 98 3999 RTX 3090
YOLOx COCO TensorRT 32 FP16/INT8 PTQ max/per-tensor mAP: 0.48
mAP_50: 0.673
mAP_75: 0.528
mAP_s: 0.295
mAP_m: 0.523
mAP_l: 0.642
25397 98 3719 RTX 3090

CenterNet

Model Data Framework Batch Size Float/Int Quantization Method mAP FPS Size (MB) Memory (MB) Device
CenterNet
download
COCO PyTorch 32 FP32 - mAP: 0.295
mAP_50: 0.462
mAP_75: 0.314
mAP_s: 0.102
mAP_m: 0.33
mAP_l: 0.466
3271 5171 RTX 3090
CenterNet COCO TensorRT 32 FP32 - mAP: 0.295
mAP_50: 0.461
mAP_75: 0.314
mAP_s: 0.102
mAP_m: 0.33
mAP_l: 0.466
15842 58 8241 RTX 3090
CenterNet COCO TensorRT 32 FP16 - mAP: 0.294
mAP_50: 0.46
mAP_75: 0.313
mAP_s: 0.102
mAP_m: 0.329
mAP_l: 0.463
16162 29 5183 RTX 3090
CenterNet COCO TensorRT 32 FP32/INT8 PTQ max/per-tensor mAP: 0.29
mAP_50: 0.456
mAP_75: 0.306
mAP_s: 0.101
mAP_m: 0.324
mAP_l: 0.457
14814 25 4673 RTX 3090
CenterNet COCO TensorRT 32 FP16/INT8 PTQ max/per-tensor mAP: 0.29
mAP_50: 0.456
mAP_75: 0.307
mAP_s: 0.101
mAP_m: 0.325
mAP_l: 0.456
16754 19 4117 RTX 3090

Install

Clone

git clone git@github.com:DerryHub/BEVFormer_tensorrt.git
cd BEVFormer_tensorrt
PROJECT_DIR=$(pwd)

Data Preparation

MS COCO (For 2D Detection)

Download the COCO 2017 datasets to /path/to/coco and unzip them.

cd ${PROJECT_DIR}/data
ln -s /path/to/coco coco

NuScenes and CAN bus (For BEVFormer)

Download nuScenes V1.0 full dataset data and CAN bus expansion data HERE as /path/to/nuscenes and /path/to/can_bus.

Prepare nuscenes data like BEVFormer.

cd ${PROJECT_DIR}/data
ln -s /path/to/nuscenes nuscenes
ln -s /path/to/can_bus can_bus

cd ${PROJECT_DIR}
sh samples/bevformer/create_data.sh

Tree

${PROJECT_DIR}/data/.
├── can_bus
│   ├── scene-0001_meta.json
│   ├── scene-0001_ms_imu.json
│   ├── scene-0001_pose.json
│   └── ...
├── coco
│   ├── annotations
│   ├── test2017
│   ├── train2017
│   └── val2017
└── nuscenes
    ├── maps
    ├── samples
    ├── sweeps
    └── v1.0-trainval

Install Packages

CUDA/cuDNN/TensorRT

Download and install the CUDA-11.6/cuDNN-8.6.0/TensorRT-8.5.1.7 following NVIDIA.

PyTorch

Install PyTorch and TorchVision following the official instructions.

pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

MMCV-full

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.5.0
pip install -r requirements/optional.txt
MMCV_WITH_OPS=1 pip install -e .

MMDetection

git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
git checkout v2.25.1
pip install -v -e .
# "-v" means verbose, or more output
# "-e" means installing a project in editable mode,
# thus any local modifications made to the code will take effect without reinstallation.

MMDeploy

git clone git@github.com:open-mmlab/mmdeploy.git
cd mmdeploy
git checkout v0.10.0

git clone git@github.com:NVIDIA/cub.git third_party/cub
cd third_party/cub
git checkout c3cceac115

# go back to third_party directory and git clone pybind11
cd ..
git clone git@github.com:pybind/pybind11.git pybind11
cd pybind11
git checkout 70a58c5
Build TensorRT Plugins of MMDeploy

Make sure cmake version >= 3.14.0 and gcc version >= 7.

export MMDEPLOY_DIR=/the/root/path/of/MMDeploy
export TENSORRT_DIR=/the/path/of/tensorrt
export CUDNN_DIR=/the/path/of/cuda

export LD_LIBRARY_PATH=$TENSORRT_DIR/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDNN_DIR/lib64:$LD_LIBRARY_PATH

cd ${MMDEPLOY_DIR}
mkdir -p build
cd build
cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
make -j$(nproc) 
make install
Install MMDeploy
cd ${MMDEPLOY_DIR}
pip install -v -e .
# "-v" means verbose, or more output
# "-e" means installing a project in editable mode,
# thus any local modifications made to the code will take effect without reinstallation.

Install this Project

Build and Install Custom TensorRT Plugins
cd ${PROJECT_DIR}/TensorRT/build
cmake .. -DCMAKE_TENSORRT_PATH=/path/to/TensorRT
make -j$(nproc)
make install

Run Unit Test of Custom TensorRT Plugins

cd ${PROJECT_DIR}
sh samples/test_trt_ops.sh
Build and Install Part of Ops in MMDetection3D
cd ${PROJECT_DIR}/third_party/bevformer
python setup.py build develop

Prepare the Checkpoints

Download above PyTorch checkpoints to ${PROJECT_DIR}/checkpoints/pytorch/. The ONNX files and TensorRT engines will be saved in ${PROJECT_DIR}/checkpoints/onnx/ and ${PROJECT_DIR}/checkpoints/tensorrt/.

Custom TensorRT Plugins

Support Common TensorRT Ops in BEVFormer: Grid Sampler, Multi-scale Deformable Attention, Modulated Deformable Conv2d and Rotate.

Each operation is implemented as 2 versions: FP32/FP16 (nv_half) and FP32/FP16 (nv_half2).

For specific speed comparison, see Custom TensorRT Plugins.

Run

The following tutorial uses BEVFormer base as an example.

  • Evaluate with PyTorch
cd ${PROJECT_DIR}
# defult gpu_id is 0
sh samples/bevformer/base/pth_evaluate.sh -d ${gpu_id}
  • Evaluate with TensorRT and MMDeploy Plugins
# convert .pth to .onnx
sh samples/bevformer/base/pth2onnx.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32)
sh samples/bevformer/base/onnx2trt.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16)
sh samples/bevformer/base/onnx2trt_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32)
sh samples/bevformer/base/trt_evaluate.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16)
sh samples/bevformer/base/trt_evaluate_fp16.sh -d ${gpu_id}

# Quantization
# calibration of post training quantization
sh samples/bevformer/base/quant_max_ptq.sh -d ${gpu_id}
# convert .pth to .onnx
sh samples/bevformer/base/pth2onnx_q.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32/INT8)
sh samples/bevformer/base/onnx2trt_int8.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16/INT8)
sh samples/bevformer/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32/INT8)
sh samples/bevformer/base/trt_evaluate_int8.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16/INT8)
sh samples/bevformer/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}

# quantization aware train
# defult gpu_ids is 0,1,2,3,4,5,6,7
sh samples/bevformer/base/quant_aware_train.sh -d ${gpu_ids}
# then following the post training quantization process
  • Evaluate with TensorRT and Custom Plugins
# nv_half
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32)
sh samples/bevformer/plugin/base/onnx2trt.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half)
sh samples/bevformer/plugin/base/onnx2trt_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32)
sh samples/bevformer/plugin/base/trt_evaluate.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half)
sh samples/bevformer/plugin/base/trt_evaluate_fp16.sh -d ${gpu_id}

# nv_half2
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx_2.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half2)
sh samples/bevformer/plugin/base/onnx2trt_fp16_2.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half2)
sh samples/bevformer/plugin/base/trt_evaluate_fp16_2.sh -d ${gpu_id}

# Quantization
# calibration of post training quantization
sh samples/bevformer/base/quant_max_ptq.sh -d ${gpu_id}

# nv_half
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx_q.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP32/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP32/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}

# nv_half2
# convert .pth to .onnx
sh samples/bevformer/plugin/base/pth2onnx_q_2.sh -d ${gpu_id}
# convert .onnx to TensorRT engine (FP16-nv_half2/INT8)
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16_2.sh -d ${gpu_id}
# evaluate with TensorRT engine (FP16-nv_half2/INT8)
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16_2.sh -d ${gpu_id}

Acknowledgement

This project is mainly based on these excellent open source projects:

About

BEVFormer inference on TensorRT, including INT8 Quantization and Custom TensorRT Plugins.

License:Apache License 2.0


Languages

Language:Python 65.9%Language:Cuda 20.3%Language:C++ 13.6%Language:CMake 0.2%Language:Shell 0.0%