CUDA error: an illegal memory access was encountered

Question

CUDA error: an illegal memory access was encountered

baoachun opened this issue a month ago · comments

I have identified that the identifyTileRanges function is causing the issue, but I'm not quite sure how to resolve it. Do you have any constructive suggestions?
https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/59f5f77e3ddbac3ed9db93ec2cfe99ed6c5d121d/cuda_rasterizer/rasterizer_impl.cu#L116

Oddly, I didn't encounter any issues when I directly loaded the saved input parameters for rendering.
dump_file.zip

pytorch 1.13.1
cuda 11.4
A100 SXM4 80G

After several days of debugging, I found that the concatenation of keys might have caused anomalies, leading to memory errors. By storing them separately instead of concatenating, this issue was resolved. However, another exception occurred in the FORWARD::render function, and the problem couldn't be reproduced with the parameters saved in debug mode.

PanagiotisP · Answer 1 · Tue Apr 30 2024 16:08:43 GMT+0800 (China Standard Time)

Does it happen on the first iteration or later on? Can you check whether you have nans on any tensors fed to the rasterisation function?

baoachun · Answer 2 · Tue Apr 30 2024 17:27:15 GMT+0800 (China Standard Time)

@PanagiotisP Yes, the issue tends to occur after iterating hundreds of times, but sometimes it may take thousands of iterations to appear. I have checked the input parameters and have not found any NaN values. However, I noticed that the error occurs because currtile or prevtile exceeds the length of ranges, resulting in a memory out-of-bounds issue. At this point, currtile or prevtile are very large strange numbers, such as prevtile=3210786815 and currtile=1072937470. Do you have any insights on this?

PanagiotisP · Answer 3 · Tue Apr 30 2024 19:05:28 GMT+0800 (China Standard Time)

No, I'm sorry. I'm not very familiar with the tiling procedure, so NaN was my best bet. In your place, I think, I would also check if extremely big scale values appear for some reason (e.g. due to a regularisation term gone bad). But other than that, your guess is as good as mine

Inseo Lee · Answer 4 · Thu May 09 2024 14:45:18 GMT+0800 (China Standard Time)

@baoachun Did you solve this problem? I also got a very large strange numbers (currtile, prevtile).

baoachun · Answer 5 · Fri May 10 2024 10:08:44 GMT+0800 (China Standard Time)

@Devlee247 Yes, I changed the concatenation operation of ID and depth to store them separately, and that bug is fixed. However, I still encounter errors in the FORWARD::render function, and I suspect there are other unresolved issues in the FORWARD::render function.

Inseo Lee · Answer 6 · Fri May 10 2024 12:02:55 GMT+0800 (China Standard Time)

@baoachun Thank you for sharing, I also fixed via adding torch.cuda.empty_cache() in the forward function. (GaussianSplatting pytorch Class)

baoachun · Answer 7 · Sat May 11 2024 14:19:10 GMT+0800 (China Standard Time)

@Devlee247 Thank you for sharing, unfortunately, this method does not solve my problem.

MeetFuture · Answer 8 · Fri May 24 2024 14:55:09 GMT+0800 (China Standard Time)

@PanagiotisP How to solve this problem which the gaussian primitives attributions turn to be Nan when I train for thousands iterations? And I check the Nan value and turn all nan to 0, but in the rest iterations, all the loss come to be nan.

PanagiotisP · Answer 9 · Tue May 28 2024 22:26:54 GMT+0800 (China Standard Time)

I am not sure I can help with that, as nan propagates instantly to everything. Usually, you want to ensure that you don't make any obvious illegal operations, like dividing with zeros, taking the log of a non-positive number etc.