RuntimeError: CUDNN_STATUS_BAD_PARAM in loss.backward()

Question

RuntimeError: CUDNN_STATUS_BAD_PARAM in loss.backward()

manmanCover opened this issue 5 years ago · comments

Thank you for your wonderful code! Have you met this problem before and do you know how to solve it?

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2883, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-3021f006d740>", line 3, in <module>
    runfile('/home/Sarah/project/main_gpu.py', args=[---], wdir='/home/Sarah/project')
  File "/home/Sarah/pycharm/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/home/Sarah/project/main_gpu.py", line 244, in <module>
    main()
  File "/home/Sarah/project/main_gpu.py", line 212, in main
    loss = train(imgL_crop, imgR_crop, disp_crop_L)
  File "/home/Sarah/project/main_gpu.py", line 161, in train
    loss.backward()
  File "/home/Sarah/py40/local/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/Sarah/py40/local/lib/python2.7/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDNN_STATUS_BAD_PARAM

cfzd · Answer 1 · Mon Jan 21 2019 17:37:29 GMT+0800 (China Standard Time)

As the problem occurred in the backward step, the network should have finished forward step. So I think the problem might be out of memory.
The spatial attention method could cost huge amount of memory. You can try to place the PSANet module on the last conv layer or reduce the size of input features.
If it is not a problem of out of memory, please tell me more information of your environment like PyTorch version, CUDA version, CuDNN version, etc.

manmanCover · Answer 2 · Mon Jan 21 2019 18:13:28 GMT+0800 (China Standard Time)

@cfzd
hi,
I think I found the problem.
In file PSANetFunc.py, the backward methods for both PSANetCollectFunction and PSANetDistributeFunction

b1_grad_n = mask_grad.shape[0]
		b1_grad_c = (2 * mask_grad.shape[2] - 1)*(2 * mask_grad.shape[3] - 1)
		b1_grad_h = mask_grad.shape[2]
		b1_grad_w = mask_grad.shape[3]
		**bottom1_grad = torch.zeros(b1_grad_n,b1_grad_c,b1_grad_w,b1_grad_h).cuda()**

b1_grad_w and b1_grad_h should be exchanged in bottom1_grad initialization.

cfzd · Answer 3 · Mon Jan 21 2019 19:40:13 GMT+0800 (China Standard Time)

@manmanCover Sorry, I have fixed the code. It's really confusing that I never met this problem and my experiment gains better performance. I think I need to release a benchmark soon.

manmanCover · Answer 4 · Mon Jan 21 2019 20:22:28 GMT+0800 (China Standard Time)

@cfzd Maybe your samples are all squares.
By the way, do you notice that the memory consumption of your implementation is unbalanced? When I ran the project on 2 GPUs, one of them is almost fully occupied.

sarah@Battlebox2:~$ nvidia-smi
Mon Jan 21 13:18:00 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:02:00.0 Off |                  N/A |
| 49%   80C    P2   190W / 250W |   7799MiB / 12196MiB |     72%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    Off  | 00000000:03:00.0 Off |                  N/A |
| 56%   84C    P2   186W / 250W |  12044MiB / 12188MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     12251      C   python                                      7789MiB |
|    1     12251      C   python                                      7693MiB |
+-----------------------------------------------------------------------------+

cfzd · Answer 5 · Tue Jan 22 2019 15:33:34 GMT+0800 (China Standard Time)

@manmanCover I tested my implementation with moderate and aggressive memory settings. I found slightly unbalanced memory consumption, but neither of my GPUs is fully occupied.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 0000:05:00.0     Off |                  N/A |
| 52%   71C    P2   101W / 250W |   4879MiB / 11170MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 0000:06:00.0     Off |                  N/A |
| 60%   79C    P2   108W / 250W |   5485MiB / 11172MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     17789    C   python                                        4877MiB |
|    1     17789    C   python                                        5483MiB |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 0000:05:00.0     Off |                  N/A |
| 53%   71C    P2    70W / 250W |   8999MiB / 11170MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 0000:06:00.0     Off |                  N/A |
| 62%   81C    P2    83W / 250W |   9785MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     18510    C   python                                        8995MiB |
|    1     18510    C   python                                        9781MiB |
+-----------------------------------------------------------------------------+

manmanCover · Answer 6 · Wed Jan 23 2019 21:47:13 GMT+0800 (China Standard Time)

@cfzd Thank you for your test. I have checked that my input feature size is [32, 64, 128], how about yours?
By the way, are the input images from training dataset and test dataset must keep the same size?

manmanCover · Answer 7 · Mon Jan 28 2019 21:49:59 GMT+0800 (China Standard Time)

@cfzd By the way, is your implementation also use the sliding windows? It seems like not...

cfzd · Answer 8 · Tue Jan 29 2019 18:01:01 GMT+0800 (China Standard Time)

@manmanCover
64x128 is a large feature size for spatial attention method. In this case, the "over-completed map" would have a feature size of [32385,64,128]. In my implementation, I always keep the channel size of the "over-completed map" lower than 10000, because the spatial attention information doesn't have to be that precise.
As for the multi-scale test, you can use the adaptive pool module before this attention module.

cfzd · Answer 9 · Tue Jan 29 2019 18:04:24 GMT+0800 (China Standard Time)

I did't find any description of sliding window in the paper and I didn't see any reason of using sliding window, as it is an attention module.

manmanCover · Answer 10 · Tue Jan 29 2019 18:06:17 GMT+0800 (China Standard Time)

@cfzd yeah, adaptive pooling can also be a choice. The author of psanet said they use sliding windows with different input size (hszhao/PSANet#11 (comment)).
Here's how they use slide windows: https://github.com/hszhao/PSANet/blob/master/evaluation/scale_process.m