RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR

Question

RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR

lyq998 opened this issue 2 years ago · comments

Hi, I met the error when running script IRR-PWC_flyingChairs.sh.
And here are my environment: pytorch 0.4.1, cuda 8.0, cudnn 7.0.1

2022-03-04 17:26:08 ==> Commandline Arguments
2022-03-04 17:26:08 batch_size: 4
2022-03-04 17:26:08 batch_size_val: 4
2022-03-04 17:26:08 checkpoint: saved_check_point/pwcnet/IRR-PWC_flyingchairsOcc/checkpoint_best.ckpt
2022-03-04 17:26:08 checkpoint_exclude_params: ['']
2022-03-04 17:26:08 checkpoint_include_params: ['*']
2022-03-04 17:26:08 checkpoint_mode: resume_from_latest
2022-03-04 17:26:08 cuda: True
2022-03-04 17:26:08 evaluation: True
2022-03-04 17:26:08 lr_scheduler: None
2022-03-04 17:26:08 model: IRR_PWC
2022-03-04 17:26:08 model_div_flow: 0.05
2022-03-04 17:26:08 name: run
2022-03-04 17:26:08 num_iters: 1
2022-03-04 17:26:08 num_workers: 4
2022-03-04 17:26:08 optimizer: Adam
2022-03-04 17:26:08 optimizer_amsgrad: False
2022-03-04 17:26:08 optimizer_betas: (0.9, 0.999)
2022-03-04 17:26:08 optimizer_eps: 1e-08
2022-03-04 17:26:08 optimizer_group: None
2022-03-04 17:26:08 optimizer_lr: 0.001
2022-03-04 17:26:08 optimizer_weight_decay: 0
2022-03-04 17:26:08 save: saved_check_point/pwcnet/eval_temp/IRR_PWC
2022-03-04 17:26:08 save_result_bidirection: False
2022-03-04 17:26:08 save_result_flo: False
2022-03-04 17:26:08 save_result_img: False
2022-03-04 17:26:08 save_result_occ: False
2022-03-04 17:26:08 save_result_path_name:
2022-03-04 17:26:08 save_result_png: False
2022-03-04 17:26:08 seed: 1
2022-03-04 17:26:08 start_epoch: 1
2022-03-04 17:26:08 total_epochs: 10
2022-03-04 17:26:08 training_augmentation: None
2022-03-04 17:26:08 training_dataset: None
2022-03-04 17:26:08 training_loss: None
2022-03-04 17:26:08 validation_augmentation: None
2022-03-04 17:26:08 validation_dataset: SintelTrainingCleanFull
2022-03-04 17:26:08 validation_dataset_photometric_augmentations: False
2022-03-04 17:26:08 validation_dataset_root: /home/liuyuqiao/MPI-Sintel-complete/
2022-03-04 17:26:08 validation_key: epe
2022-03-04 17:26:08 validation_key_minimize: True
2022-03-04 17:26:08 validation_loss: MultiScaleEPE_PWC_Bi_Occ_upsample
2022-03-04 17:26:08 ==> Random Seeds
2022-03-04 17:26:08 Python seed: 1
2022-03-04 17:26:08 Numpy seed: 2
2022-03-04 17:26:08 Torch CPU seed: 3
2022-03-04 17:26:08 Torch CUDA seed: 4
2022-03-04 17:26:08 ==> Datasets
2022-03-04 17:26:08 Validation Dataset: SintelTrainingCleanFull
2022-03-04 17:26:08 basedir: training/clean/alley_1
2022-03-04 17:26:08 input1: [3, 436, 1024]
2022-03-04 17:26:08 input2: [3, 436, 1024]
2022-03-04 17:26:08 target1: [2, 436, 1024]
2022-03-04 17:26:08 target_occ1: [1, 436, 1024]
2022-03-04 17:26:08 num_examples: 1041
2022-03-04 17:26:08 ==> Runtime Augmentations
2022-03-04 17:26:08 training_augmentation: None
2022-03-04 17:26:08 validation_augmentation: None
2022-03-04 17:26:08 ==> Model and Loss
2022-03-04 17:26:08 Initializing MSRA
/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py:114: UserWarning:
Found GPU0 GeForce RTX 2080 Ti which requires CUDA_VERSION >= 9000 for
optimal performance and fast startup time, but your PyTorch was compiled
with CUDA_VERSION 8000. Please install the correct PyTorch binary
using instructions from http://pytorch.org

warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))
2022-03-04 17:33:19 Batch Size: 4
2022-03-04 17:33:19 GPGPU: Cuda
2022-03-04 17:33:19 Network: IRR_PWC
2022-03-04 17:33:19 Number of parameters: 6362092
2022-03-04 17:33:19 Validation Key: epe
2022-03-04 17:33:19 Validation Loss: MultiScaleEPE_PWC_Bi_Occ_upsample
2022-03-04 17:33:19 ==> Checkpoint
2022-03-04 17:33:19 ==> Save Directory
2022-03-04 17:33:19 Save directory: saved_check_point/pwcnet/eval_temp/IRR_PWC
2022-03-04 17:33:19 ==> Optimizer
2022-03-04 17:33:19 Adam
2022-03-04 17:33:19 amsgrad: False
2022-03-04 17:33:19 betas: (0.9, 0.999)
2022-03-04 17:33:19 eps: 1e-08
2022-03-04 17:33:19 lr: 0.001
2022-03-04 17:33:19 weight_decay: 0
2022-03-04 17:33:19 ==> Learning Rate Scheduler
2022-03-04 17:33:19 class: None
2022-03-04 17:33:19 ==> Runtime
2022-03-04 17:33:19 start_epoch: 1
2022-03-04 17:33:19 total_epochs: 1

==> Progress: 0%| | 0/1 00:00<? ?s/ep

2022-03-04 17:33:19 ==> Epoch 1/1
==> Validate: 0%| | 0/261 00:00<? ?it/s
Traceback (most recent call last):
File "../../main.py", line 89, in
main()
File "../../main.py", line 86, in main
validation_augmentation=validation_augmentation)
File "/home/liuyuqiao/irr/runtime.py", line 555, in exec_runtime
augmentation=validation_augmentation).run()
File "/home/liuyuqiao/irr/runtime.py", line 427, in run
loss_dict_per_step, output_dict, batch_size = self._step(example_dict)
File "/home/liuyuqiao/irr/runtime.py", line 384, in _step
loss_dict, output_dict = self._model_and_loss(example_dict)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/liuyuqiao/irr/configuration.py", line 49, in forward
output_dict = self._model(example_dict)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/liuyuqiao/irr/models/IRR_PWC.py", line 59, in forward
x1_pyramid = self.feature_pyramid_extractor(x1_raw) + [x1_raw]
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/liuyuqiao/irr/models/pwc_modules.py", line 101, in forward
x = conv(x)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR

Looking forward to your reply

Junhwa Hur · Answer 1 · Sat Mar 05 2022 17:56:56 GMT+0800 (China Standard Time)

Hi,

Can it be due to this warning in your log?

/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py:114: UserWarning: Found GPU0 GeForce RTX 2080 Ti which requires CUDA_VERSION >= 9000 for optimal performance and fast startup time, but your PyTorch was compiled with CUDA_VERSION 8000. Please install the correct PyTorch binary using instructions from http://pytorch.org/

lyq998 · Answer 2 · Mon Mar 07 2022 14:34:00 GMT+0800 (China Standard Time)

This warning is due to torch 0.4.1 that I used.
To avoid this warning, I tried to change the version of torch to 1.1.0 and 1.5.0, however, a new error has occurred:
Traceback (most recent call last): File "../../main.py", line 5, in <module> import commandline File "/home/liuyuqiao/irr_37/irr/commandline.py", line 14, in <module> import models File "/home/liuyuqiao/irr_37/irr/models/__init__.py", line 8, in <module> from . import pwcnet File "/home/liuyuqiao/irr_37/irr/models/pwcnet.py", line 8, in <module> from .correlation_package.correlation import Correlation File "/home/liuyuqiao/irr_37/irr/models/correlation_package/correlation.py", line 4, in <module> import correlation_cuda ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory
And I used cuda 10.0 this time, but the importerror is from 8.0.
Do you have any idea?

Junhwa Hur · Answer 3 · Mon Mar 07 2022 19:47:04 GMT+0800 (China Standard Time)

Ah, for the PWC-Net baseline in the pwcnet.py, I didn't update it with the current version of the correlation layer.
Could you edit it yourself following this comment?
#43 (comment)

lyq998 · Answer 4 · Tue Mar 08 2022 13:51:43 GMT+0800 (China Standard Time)

I have tried both versions:
# correlation out_corr_f = Correlation(pad_size=self.search_range, kernel_size=1, max_displacement=self.search_range, stride1=1, stride2=1, corr_multiply=1)(x1, x2_warp) out_corr_b = Correlation(pad_size=self.search_range, kernel_size=1, max_displacement=self.search_range, stride1=1, stride2=1, corr_multiply=1)(x2, x1_warp) # out_corr_f = compute_cost_volume(x1, x2_warp, self.corr_params) # out_corr_b = compute_cost_volume(x2, x1_warp, self.corr_params)
and
# correlation # out_corr_f = Correlation(pad_size=self.search_range, kernel_size=1, max_displacement=self.search_range, stride1=1, stride2=1, corr_multiply=1)(x1, x2_warp) # out_corr_b = Correlation(pad_size=self.search_range, kernel_size=1, max_displacement=self.search_range, stride1=1, stride2=1, corr_multiply=1)(x2, x1_warp) out_corr_f = compute_cost_volume(x1, x2_warp, self.corr_params) out_corr_b = compute_cost_volume(x2, x1_warp, self.corr_params)
but they cannot work. And the error is RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR.
I'm using torch==0.4.1, because 1.5.0 or 1.1.0 will lead ImportError: libcudart.so.8.0.

Junhwa Hur · Answer 5 · Tue Mar 08 2022 17:46:16 GMT+0800 (China Standard Time)

What's the error message when you use this?

out_corr_f = compute_cost_volume(x1, x2_warp, self.corr_params) 
out_corr_b = compute_cost_volume(x2, x1_warp, self.corr_params)

I am not so sure, but the problem rather comes from a version mismatch between pytorch and cuda/cudnn.

Also comment this line

from .correlation_package.correlation import Correlation

so that the source code doesn't import the correlation layer written in CUDA.