RuntimeError: CUDNN_STATUS_BAD_PARAM in loss.backward()
manmanCover opened this issue · comments
Thank you for your wonderful code! Have you met this problem before and do you know how to solve it?
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2883, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-3021f006d740>", line 3, in <module>
runfile('/home/Sarah/project/main_gpu.py', args=[---], wdir='/home/Sarah/project')
File "/home/Sarah/pycharm/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/home/Sarah/project/main_gpu.py", line 244, in <module>
main()
File "/home/Sarah/project/main_gpu.py", line 212, in main
loss = train(imgL_crop, imgR_crop, disp_crop_L)
File "/home/Sarah/project/main_gpu.py", line 161, in train
loss.backward()
File "/home/Sarah/py40/local/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/Sarah/py40/local/lib/python2.7/site-packages/torch/autograd/__init__.py", line 89, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDNN_STATUS_BAD_PARAM
As the problem occurred in the backward step, the network should have finished forward step. So I think the problem might be out of memory.
The spatial attention method could cost huge amount of memory. You can try to place the PSANet module on the last conv layer or reduce the size of input features.
If it is not a problem of out of memory, please tell me more information of your environment like PyTorch version, CUDA version, CuDNN version, etc.
@cfzd
hi,
I think I found the problem.
In file PSANetFunc.py, the backward methods for both PSANetCollectFunction and PSANetDistributeFunction
b1_grad_n = mask_grad.shape[0]
b1_grad_c = (2 * mask_grad.shape[2] - 1)*(2 * mask_grad.shape[3] - 1)
b1_grad_h = mask_grad.shape[2]
b1_grad_w = mask_grad.shape[3]
**bottom1_grad = torch.zeros(b1_grad_n,b1_grad_c,b1_grad_w,b1_grad_h).cuda()**
b1_grad_w and b1_grad_h should be exchanged in bottom1_grad initialization.
@manmanCover Sorry, I have fixed the code. It's really confusing that I never met this problem and my experiment gains better performance. I think I need to release a benchmark soon.
@cfzd Maybe your samples are all squares.
By the way, do you notice that the memory consumption of your implementation is unbalanced? When I ran the project on 2 GPUs, one of them is almost fully occupied.
sarah@Battlebox2:~$ nvidia-smi
Mon Jan 21 13:18:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25 Driver Version: 390.25 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:02:00.0 Off | N/A |
| 49% 80C P2 190W / 250W | 7799MiB / 12196MiB | 72% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 00000000:03:00.0 Off | N/A |
| 56% 84C P2 186W / 250W | 12044MiB / 12188MiB | 89% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 12251 C python 7789MiB |
| 1 12251 C python 7693MiB |
+-----------------------------------------------------------------------------+
@manmanCover I tested my implementation with moderate and aggressive memory settings. I found slightly unbalanced memory consumption, but neither of my GPUs is fully occupied.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 0000:05:00.0 Off | N/A |
| 52% 71C P2 101W / 250W | 4879MiB / 11170MiB | 83% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... On | 0000:06:00.0 Off | N/A |
| 60% 79C P2 108W / 250W | 5485MiB / 11172MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17789 C python 4877MiB |
| 1 17789 C python 5483MiB |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... On | 0000:05:00.0 Off | N/A |
| 53% 71C P2 70W / 250W | 8999MiB / 11170MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... On | 0000:06:00.0 Off | N/A |
| 62% 81C P2 83W / 250W | 9785MiB / 11172MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 18510 C python 8995MiB |
| 1 18510 C python 9781MiB |
+-----------------------------------------------------------------------------+
@cfzd Thank you for your test. I have checked that my input feature size is [32, 64, 128], how about yours?
By the way, are the input images from training dataset and test dataset must keep the same size?
@cfzd By the way, is your implementation also use the sliding windows? It seems like not...
@manmanCover
64x128 is a large feature size for spatial attention method. In this case, the "over-completed map" would have a feature size of [32385,64,128]. In my implementation, I always keep the channel size of the "over-completed map" lower than 10000, because the spatial attention information doesn't have to be that precise.
As for the multi-scale test, you can use the adaptive pool module before this attention module.
I did't find any description of sliding window in the paper and I didn't see any reason of using sliding window, as it is an attention module.
@cfzd yeah, adaptive pooling can also be a choice. The author of psanet said they use sliding windows with different input size (hszhao/PSANet#11 (comment)).
Here's how they use slide windows: https://github.com/hszhao/PSANet/blob/master/evaluation/scale_process.m