open-mmlab / mmdeploy

OpenMMLab Model Deployment Framework

Home Page:https://mmdeploy.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inference error

wulouzhu opened this issue · comments

Hi:
I have converted faster-rcnn model downloaded from mmdetection zoo to trt engine sucessfully, but when I run inference_model the error happened:
[2022-04-22 07:27:52.715] [mmdeploy] [info] [model.cpp:95] Register 'DirectoryModel'
2022-04-22 07:27:57,889 - mmdeploy - INFO - Successfully loaded tensorrt plugins from /root/workspace/mmdeploy/build/lib/libmmdeploy_tensorrt_ops.so
2022-04-22 07:27:57,889 - mmdeploy - INFO - Successfully loaded tensorrt plugins from /root/workspace/mmdeploy/build/lib/libmmdeploy_tensorrt_ops.so
/opt/conda/lib/python3.8/site-packages/mmdet-2.22.0-py3.8.egg/mmdet/datasets/utils.py:66: UserWarning: "ImageToTensor" pipeline is replaced by "DefaultFormatBundle" for batch inference. It is recomm ended to manually replace it in the test data pipeline in your config file.
warnings.warn(
#assertion/root/workspace/mmdeploy/csrc/backend_ops/tensorrt/batched_nms/trt_batched_nms.cpp,98
Aborted (core dumped)

Could you please tell me why it happend and how to deal with it? Thank you.

commented

Hi, sorry for the late reply. Could you provide more detail about your environment? You can use https://github.com/open-mmlab/mmdeploy/blob/master/tools/check_env.py to check the environment.

@grimoire
The problem above happened because the mmdetection config was wrong. Now I have solved it. But when I turned to inference mask-rcnn with python, I got the error:
[2022-04-25 07:27:01.643] [mmdeploy] [info] [model.cpp:95] Register 'DirectoryModel'
2022-04-25 07:27:06,617 - mmdeploy - INFO - Successfully loaded tensorrt plugins from /root/workspace/mmdeploy/build/lib/libmmdeploy_tensorrt_ops.so
2022-04-25 07:27:06,617 - mmdeploy - INFO - Successfully loaded tensorrt plugins from /root/workspace/mmdeploy/build/lib/libmmdeploy_tensorrt_ops.so
/opt/conda/lib/python3.8/site-packages/mmdet-2.23.0-py3.8.egg/mmdet/datasets/utils.py:66: UserWarning: "ImageToTensor" pipeline is replaced by "DefaultFormatBundle" for batch inference. It is recommended to manually replace it in the test data pipeline in your config file.
warnings.warn(
Traceback (most recent call last):
File "mmdetection/demo/deploy.py", line 16, in
result = inference_model(model_cfg, deploy_cfg, backend_files, img=img, device=device)
File "/root/workspace/mmdeploy/mmdeploy/apis/inference.py", line 51, in inference_model
result = task_processor.run_inference(model, model_inputs)
File "/root/workspace/mmdeploy/mmdeploy/codebase/mmdet/deploy/object_detection.py", line 199, in run_inference
return model(**model_inputs, return_loss=False, rescale=True)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/root/workspace/mmdeploy/mmdeploy/codebase/mmdet/deploy/object_detection_model.py", line 202, in forward
outputs = End2EndModel.__clear_outputs(outputs)
File "/root/workspace/mmdeploy/mmdeploy/codebase/mmdet/deploy/object_detection_model.py", line 110, in __clear_outputs
outputs[output_id][i] = test_outputs[output_id][i, inds, ...]
RuntimeError: CUDA error: misaligned address
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1614378062065/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f4167cbb2f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x5b (0x7f4167cb867b in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x809 (0x7f4167f14219 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f4167ca33a4 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x6e0d9a (0x7f41b4ae8d9a in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x6e0e31 (0x7f41b4ae8e31 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #17: __libc_start_main + 0xf3 (0x7f41daa0f0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

And my environment which is built from DOCKERFILE is(by using tools/check_env.py) :
2022-04-25 07:37:39,870 - mmdeploy - INFO -

2022-04-25 07:37:39,870 - mmdeploy - INFO - Environmental information
fatal: not a git repository (or any of the parent directories): .git
2022-04-25 07:37:41,283 - mmdeploy - INFO - sys.platform: linux
2022-04-25 07:37:41,283 - mmdeploy - INFO - Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
2022-04-25 07:37:41,283 - mmdeploy - INFO - CUDA available: True
2022-04-25 07:37:41,283 - mmdeploy - INFO - GPU 0: Quadro RTX 6000
2022-04-25 07:37:41,283 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-04-25 07:37:41,283 - mmdeploy - INFO - NVCC: Build cuda_11.3.r11.3/compiler.29745058_0
2022-04-25 07:37:41,283 - mmdeploy - INFO - GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
2022-04-25 07:37:41,283 - mmdeploy - INFO - PyTorch: 1.8.0
2022-04-25 07:37:41,284 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

2022-04-25 07:37:41,284 - mmdeploy - INFO - TorchVision: 0.9.0
2022-04-25 07:37:41,284 - mmdeploy - INFO - OpenCV: 4.5.5
2022-04-25 07:37:41,284 - mmdeploy - INFO - MMCV: 1.4.0
2022-04-25 07:37:41,284 - mmdeploy - INFO - MMCV Compiler: GCC 7.3
2022-04-25 07:37:41,284 - mmdeploy - INFO - MMCV CUDA Compiler: 10.2
2022-04-25 07:37:41,284 - mmdeploy - INFO - MMDeploy: 0.4.0+
2022-04-25 07:37:41,284 - mmdeploy - INFO -

2022-04-25 07:37:41,284 - mmdeploy - INFO - Backend information
[2022-04-25 07:37:41.475] [mmdeploy] [info] [model.cpp:95] Register 'DirectoryModel'
2022-04-25 07:37:41,542 - mmdeploy - INFO - onnxruntime: 1.8.1 ops_is_avaliable : True
2022-04-25 07:37:41,543 - mmdeploy - INFO - tensorrt: 7.2.3.4 ops_is_avaliable : True
2022-04-25 07:37:41,543 - mmdeploy - INFO - ncnn: None ops_is_avaliable : False
2022-04-25 07:37:41,544 - mmdeploy - INFO - pplnn_is_avaliable: False
2022-04-25 07:37:41,544 - mmdeploy - INFO - openvino_is_avaliable: False
2022-04-25 07:37:41,544 - mmdeploy - INFO -

2022-04-25 07:37:41,544 - mmdeploy - INFO - Codebase information
2022-04-25 07:37:41,545 - mmdeploy - INFO - mmdet: 2.23.0
2022-04-25 07:37:41,545 - mmdeploy - INFO - mmseg: None
2022-04-25 07:37:41,546 - mmdeploy - INFO - mmcls: None
2022-04-25 07:37:41,546 - mmdeploy - INFO - mmocr: None
2022-04-25 07:37:41,546 - mmdeploy - INFO - mmedit: None
2022-04-25 07:37:41,546 - mmdeploy - INFO - mmdet3d: None
2022-04-25 07:37:41,546 - mmdeploy - INFO - mmpose: None

Looking forward to your reply!

commented

What is your host cuda driver? The MMDeploy in docker is built with nvcc==11.3 but your pytorch and mmcv are build with cuda10.2.

@grimoire
My host cuda driver is 510.54. My docker is built from https://github.com/open-mmlab/mmdeploy/blob/master/docker/GPU/Dockerfile.
In the dockerfile, pytorch and mmcv are build with cuda10.2
FROM nvcr.io/nvidia/tensorrt:21.04-py3

ARG CUDA=10.2
ARG PYTHON_VERSION=3.8
ARG TORCH_VERSION=1.8.0
ARG TORCHVISION_VERSION=0.9.0
ARG ONNXRUNTIME_VERSION=1.8.1
ARG MMCV_VERSION=1.4.0
ARG PPLCV_VERSION=0.6.2
ENV FORCE_CUDA="1"

Is the Dockerfile wrong?

At the beginning ,I built a docker with pytorch and mmcv build with cuda11.3.
root@07231d93287f:~/workspace# python mmdeploy/tools/check_env.py
2022-04-25 09:11:44,986 - mmdeploy - INFO -

2022-04-25 09:11:44,986 - mmdeploy - INFO - Environmental information
fatal: not a git repository (or any of the parent directories): .git
2022-04-25 09:11:46,632 - mmdeploy - INFO - sys.platform: linux
2022-04-25 09:11:46,633 - mmdeploy - INFO - Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
2022-04-25 09:11:46,633 - mmdeploy - INFO - CUDA available: True
2022-04-25 09:11:46,633 - mmdeploy - INFO - GPU 0: Quadro RTX 6000
2022-04-25 09:11:46,633 - mmdeploy - INFO - CUDA_HOME: /usr/local/cuda
2022-04-25 09:11:46,633 - mmdeploy - INFO - NVCC: Build cuda_11.3.r11.3/compiler.29745058_0
2022-04-25 09:11:46,633 - mmdeploy - INFO - GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
2022-04-25 09:11:46,633 - mmdeploy - INFO - PyTorch: 1.10.0
2022-04-25 09:11:46,633 - mmdeploy - INFO - PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2022.0-Product Build 20211112 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

2022-04-25 09:11:46,634 - mmdeploy - INFO - TorchVision: 0.11.0
2022-04-25 09:11:46,634 - mmdeploy - INFO - OpenCV: 4.5.5
2022-04-25 09:11:46,634 - mmdeploy - INFO - MMCV: 1.4.0
2022-04-25 09:11:46,634 - mmdeploy - INFO - MMCV Compiler: GCC 7.3
2022-04-25 09:11:46,634 - mmdeploy - INFO - MMCV CUDA Compiler: 11.3
2022-04-25 09:11:46,634 - mmdeploy - INFO - MMDeploy: 0.4.0+
2022-04-25 09:11:46,634 - mmdeploy - INFO -

2022-04-25 09:11:46,635 - mmdeploy - INFO - Backend information
[2022-04-25 09:11:46.841] [mmdeploy] [info] [model.cpp:95] Register 'DirectoryModel'
2022-04-25 09:11:46,915 - mmdeploy - INFO - onnxruntime: 1.8.1 ops_is_avaliable : True
2022-04-25 09:11:46,916 - mmdeploy - INFO - tensorrt: 7.2.3.4 ops_is_avaliable : True
2022-04-25 09:11:46,916 - mmdeploy - INFO - ncnn: None ops_is_avaliable : False
2022-04-25 09:11:46,916 - mmdeploy - INFO - pplnn_is_avaliable: False
2022-04-25 09:11:46,917 - mmdeploy - INFO - openvino_is_avaliable: False
2022-04-25 09:11:46,917 - mmdeploy - INFO -

2022-04-25 09:11:46,917 - mmdeploy - INFO - Codebase information
2022-04-25 09:11:46,918 - mmdeploy - INFO - mmdet: 2.23.0
2022-04-25 09:11:46,918 - mmdeploy - INFO - mmseg: None
2022-04-25 09:11:46,918 - mmdeploy - INFO - mmcls: None
2022-04-25 09:11:46,918 - mmdeploy - INFO - mmocr: None
2022-04-25 09:11:46,918 - mmdeploy - INFO - mmedit: None
2022-04-25 09:11:46,918 - mmdeploy - INFO - mmdet3d: None
2022-04-25 09:11:46,918 - mmdeploy - INFO - mmpose: None

But when I converted mask-rcnn model to trt engine, I got the error:
[TensorRT] WARNING: Output type must be INT32 for shape outputs
[TensorRT] WARNING: Output type must be INT32 for shape outputs
[TensorRT] WARNING: Output type must be INT32 for shape outputs
[TensorRT] WARNING: Output type must be INT32 for shape outputs
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] ERROR: ../builder/cudnnBuilderUtils.cpp (408) - Cuda Error in findFastestTactic: 700 (an illegal memory access was encountered)
[TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 700 (an illegal memory access was encountered)
terminate called after throwing an instance of 'nvinfer1::CudaError'
what(): std::exception
2022-04-25 09:10:45,266 - mmdeploy - ERROR - onnx2tensorrt of mmdetection/checkpoints/mask_rcnn/end2end.onnx failed.

So I turned to cuda10.2 which provided by the project without any change

commented

Errr, I want to know the cuda driver of your host (outside the docker). If the cuda version in docker is higher than the which your host driver supported, you might get unexpected result.

I have answered the question above

@grimoire My host cuda driver is 510.54. My docker is built from https://github.com/open-mmlab/mmdeploy/blob/master/docker/GPU/Dockerfile. In the dockerfile, pytorch and mmcv are build with cuda10.2 FROM nvcr.io/nvidia/tensorrt:21.04-py3

ARG CUDA=10.2 ARG PYTHON_VERSION=3.8 ARG TORCH_VERSION=1.8.0 ARG TORCHVISION_VERSION=0.9.0 ARG ONNXRUNTIME_VERSION=1.8.1 ARG MMCV_VERSION=1.4.0 ARG PPLCV_VERSION=0.6.2 ENV FORCE_CUDA="1"

Is the Dockerfile wrong?

Hi, @wulouzhu. Could you please provide the conversion scripts?

@AllentDan
python ${MMDEPLOY_DIR}/tools/deploy.py ${MMDEPLOY_DIR}/configs/mmdet/detection/detection_tensorrt_dynamic-320x320-1344x1344.py ${MMDET_DIR}/configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ${CHECKPOINT_DIR}/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth ${INPUT_IMG} --work-dir ${WORK_DIR} --device cuda:0 --dump-info

When I use cuda10.2 for pytorch and mmcv, it can convert sucessfully but fail to infer. When I use cuda11.3 for pytorch and mmcv, it failed to convert. The error detail you can find above

That's weird. I tested it successfully a minute ago with mmdeploy Dockerfile with the following commands:

docker run --gpus all -it -p 8081:8082 gpu_test
pip install mmdet
cd ~/workspace && git clone https://github.com/open-mmlab/mmdetection.git && cd mmdeploy
python tools/deploy.py configs/mmdet/detection/detection_tensorrt_dynamic-320x320-1344x1344.py  ../mmdetection/configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py https://download.openmmlab.com/mmdetection/v2.0/mask_rcnn/mask_rcnn_r50_fpn_1x_coco/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth ../mmdetection/demo/demo.jpg --work-dir ../work-dir/mmdet --device cuda:0 --dump-info

@AllentDan python ${MMDEPLOY_DIR}/tools/deploy.py ${MMDEPLOY_DIR}/configs/mmdet/detection/detection_tensorrt_dynamic-320x320-1344x1344.py ${MMDET_DIR}/configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ${CHECKPOINT_DIR}/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth ${INPUT_IMG} --work-dir ${WORK_DIR} --device cuda:0 --dump-info

When I use cuda10.2 for pytorch and mmcv, it can convert sucessfully but fail to infer. When I use cuda11.3 for pytorch and mmcv, it failed to convert. The error detail you can find above

Again with the latest mmdeploy, I also encounter the error. It might be a bug because of the latest features. Will fix it ASAP. If your are in urgency, previous version could be ok for you.

@AllentDan
What's the cuda version did you use for pytorch and mmcv?

Jus the default verion in the Docker file. Is is fine to use different CUDA version under the conda env.

@AllentDan
Could you please try cuda11.3 for pytorch1.10 and mmcv in the Dockerfile? I want to know whether you encounter the error as same as mine when converting mask-rcnn model. Thanks!

ERROR: ../builder/cudnnBuilderUtils.cpp (408) - C

Sorry for the late reply. I am working on it and will inform you as soon as it gets fixed.

Hi, there. I tried to reproduce the error you encountered today and it turned out to be that we used the wrong configuration file. The expected file should be configs/mmdet/instance-seg/instance-seg_tensorrt_dynamic-320x320-1344x1344.py for Mask-RCNN while configs/mmdet/detection/detection_tensorrt_dynamic-320x320-1344x1344.py is not for instance segmentation task.

It's nice of you. I have solved the question. But I have another doubt for the meaning of the opt_shape in the instance-seg_tensorrt_dynamic-320x320-1344x1344.py.

Well, the max_shape and min_shape are the settings for the input resolution range when inference.

Well, the max_shape and min_shape are the settings for the input resolution range when inference.

Yes, I know that, what about the opt_shape?

Just the shape the TensorRT will optimize on.