Illegal memory access during back propagation unit test

Question

Illegal memory access during back propagation unit test

5had3z opened this issue 4 years ago · comments

Hi, I am having issues running correlaton_native.py during the backward phase:
RuntimeError: CUDA error: an illegal memory access was encountered
I first modified your implementation to update it to PyTorch 1.6.0 and ran into this issue.
So then I tried to use your docker file, however jonathonf removed his python3.6 repository for ubuntu 16.04. Consequently I made the following changes to the docker file:

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04
RUN pip3 install https://download.pytorch.org/whl/cu100/torch-1.1.0-cp36-cp36m-linux_x86_64.whl

This still resulted in errors during the backpropation stages, specifically during:
correlation_backward_input1
correlation_backward_input2

I tried printing the dims to make sure the tensor shapes were correct in some of the functions:
(Pytorch backward)
Grad Output torch.Size([4, 81, 120, 120])
Input Dims torch.Size([4, 256, 128, 128])

(correlation_backward_cuda)
Input batch: 4 ch: 256 h: 128 w: 128

(correlation_backward_cuda_kernel, after channels_first calls)
rInput batch: 4 ch: 128 h: 128 w: 256
gradInput1 batch: 4 ch: 256 h: 128 w: 128
gradOutput batch: 4 ch: 81 h: 120 w: 120

Any idea where the issue is arising from? Is there a subtle difference in changing CUDA9->10 in the docker image?

Bryce Ferenczi · Answer 1 · Thu Aug 20 2020 21:19:16 GMT+0800 (China Standard Time)

If I change the max_displacement to 1 or 2 and C=H=W=64, it works fine.
But if I have C=H=W=128 it doesn't work (with max_displacement 1 or 2)

Liang Liu · Answer 2 · Sun Aug 23 2020 16:21:33 GMT+0800 (China Standard Time)

This is an issue about correlation_cuda package. I am not very familiar with cuda programming, so I may not be able to help you to solve this problem.

If you have trouble with this package during training, you can alternatively use my PyTorch implementation (It is correct although kind of slower.)

Since the correlation_cuda package is widely used in other projects, such as ClementPinard/Pytorch-Correlation-extension, NVIDIA/flownet2-pytorch, you can refer to these repos for help.

Bryce Ferenczi · Answer 3 · Sun Aug 23 2020 16:38:16 GMT+0800 (China Standard Time)

From some testing that I did, there are access requests for index -1 during some operations.
correlaiton forward kernel during the element wise product sum: prod_sum += rInput1*rInput2
And in correlation_backward_input1 when reading from rInput2

In my own code I skip these operations during a boundary check and consequently don't have this issue anymore.

Liang Liu · Answer 4 · Sun Aug 23 2020 16:45:20 GMT+0800 (China Standard Time)

Thanks for sharing and I'm glad you could find a workaround in the end!

sun0215 · Answer 5 · Wed Dec 22 2021 17:21:12 GMT+0800 (China Standard Time)

Hi, I have met the same issue as you. May I ask how do you use boundary check to skip these operations you mentioned above? @5had3z

Bryce Ferenczi · Answer 6 · Wed Dec 22 2021 19:28:42 GMT+0800 (China Standard Time)

@sun0215 I've got the checks in my re-implementation but they're commented out as it turns out it this issue is due to insufficient padding (the pad_size variable). You won't access out of bounds if this is large enough, just search for the smallest number that works for you.

Bound checks are commented out as cuda cores are dumb afaik, there's no branch predicting, so you're paying the full cost of these checks, hence why I've commented out, but still there for future reference.

sun0215 · Answer 7 · Fri Dec 24 2021 10:03:14 GMT+0800 (China Standard Time)

I have solved this issue by setting padding as 20 for 256*256 image. Thank you very much.<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <br/><br/><br/> <a href="https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=sunnysun0215&uid=sunnysun0215%40mail.ustc.edu.cn&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Asunnysun0215%40mail.ustc.edu.cn%22%5D&logoUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyeicon%2F7ed06fe6e7c0cf2db4ee0363fad699b6.png" style="display:block;background:#fff; max-width: 400px; _width: 400px;padding:15px 0 10px 0;text-decoration: none; outline:none;-webkit-tap-highlight-color:transparent;-webkit-text-size-adjust:none !important;text-size-adjust:none !important;"> <table cellpadding="0" style='width: 100%; max-width: 100%; table-layout: fixed; border-collapse: collapse;color: #9b9ea1;font-size: 14px;line-height:1.3;-webkit-text-size-adjust:none !important;text-size-adjust:none !important;'> <tbody style="font-family: 'PingFang SC', 'Hiragino Sans GB','WenQuanYi Micro Hei', 'Microsoft Yahei', '微软雅黑', verdana !important; word-wrap:break-word; word-break:break-all;-webkit-text-size-adjust:none !important;text-size-adjust:none !important;"> <tr> <td width="38" style="padding:0; box-sizing: border-box; width: 38px;"> <img width="38" height="38" style="vertical-align:middle; width: 38px; height: 38px; border-radius:50%;" src="https://mail-online.nosdn.127.net/qiyelogo/defaultAvatar.png" /> </td> <td style='padding: 0 30px 0 10px; color: #31353b;'> <div style="font-size: 16px;font-weight:bold; width:100%; white-space: nowrap; overflow:hidden;text-overflow: ellipsis;">sunnysun0215</div> </td> <td width="72" style="text-align:right; width: 72px;"> <img width="72" height="20" style="width: 72px; height: 20px;" src="https://mail-online.nosdn.127.net/qiyelogo/7ed06fe6e7c0cf2db4ee0363fad699b6.png" /> </td> </tr> <tr width="100%" style="font-size: 14px !important; width: 100%;"> <td colspan='3' style="padding:10px 0 0 0; font-size:14px !important; width: 100%;"> <div style="width: 100%;font-size: 14px ***@***.***</div> </td> </tr> </tbody> </table> </a><br/><br/><br/><div class="ntes-mailmaster-quote" style="padding-top: 1px; padding-bottom: 1px" > <div style=" margin-top: 2em; margin-bottom: 1em; font-size: 14px; line-height: 1.25; color: #78787a; " >---- 回复的原邮件 ----</div> <div style=" margin-bottom: 1em; font-size: 12px; line-height: 1.25; color: #232324; padding: 0.5em 0.25em; border-radius: 0.67em; background-color: #f0f0f0; " > <table width="100%" cellpadding="0" cellspacing="9" border="0"> <tr> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > 发件人 </td> <td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > <a class="mail-from" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a> </td> </tr> <tr> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > 日期 </td> <td class="mail-date" valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > 2021年12月22日 19:29 </td> </tr> <tr style=""> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > 收件人 </td> <td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > <a class="mail-to" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a> </td> </tr> <tr style=""> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > 抄送至 </td> <td valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > <a class="mail-cc" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a>、<a class="mail-cc" style="text-decoration:none;color:#0886e8;" ***@***.******@***.***></a> </td> </tr> <tr> <td valign="top" style=" width: 4.25em; font-size: 12px; line-height: 1.25; color: #78787a; " > 主题 </td> <td class="mail-subject" valign="top" style=" font-size: 12px; line-height: 1.25; color: #232324; word-break: break-all; " > Re: [lliuz/ARFlow] Illegal memory access during back propagation unit test (#10) </td> </tr> </table> </div> <div><p></p> <p dir="auto"><a class="user-mention" data-hovercard-type="user" data-hovercard-url="/users/sun0215/hovercard" data-octo-click="hovercard-link-click" data-octo-dimensions="link_type:self" ***@***.***</a> I've got the checks in my <a href="https://github.com/5had3z/CerberusNet/blob/master/nnet_training/correlation_package/correlation_cuda_kernel.cu">re-implementation</a> but they're commented out as it turns out it this issue is due to insufficient padding (the pad_size variable). You won't access out of bounds if this is large enough, just search for the smallest number that works for you.</p> <p dir="auto">Bound checks are commented out as cuda cores are dumb afaik, there's no branch predicting, so you're paying the full cost of these checks, hence why I've commented out, but still there for future reference.</p> <p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />Reply to this email directly, <a href="#10 (comment)">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AWOOE4SZGBUDW7QYCAG6UI3USGY7NANCNFSM4QGA3ZWQ">unsubscribe</a>.<br />Triage notifications on the go with GitHub Mobile for <a href="https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675">iOS</a> or <a href="https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub">Android</a>. <br />You are receiving this because you were mentioned.<img src="https://github.com/notifications/beacon/AWOOE4T4RR5ZIYCJHZLETW3USGY7NA5CNFSM4QGA3ZW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOHOJTD2A.gif" height="1" width="1" alt="" /><span style="color: transparent; font-size: 0; display: none; visibility: hidden; overflow: hidden; opacity: 0; width: 0; height: 0; max-width: 0; max-height: 0; mso-hide: all">Message ID: <span><lliuz/ARFlow/issues/10/999502312</span><span>@</span><span>github</span><span>.</span><span>com></span></span></p> <script type="application/ld+json">[ { ***@***.***": "http://schema.org", ***@***.***": "EmailMessage", "potentialAction": { ***@***.***": "ViewAction", "target": "#10 (comment)", "url": "#10 (comment)", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { ***@***.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]</script> </div> </div> </body> </html>