graphdeco-inria / gaussian-splatting

Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering"

Home Page:https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUDA error: an illegal memory access was encountered

baoachun opened this issue · comments

I have identified that the identifyTileRanges function is causing the issue, but I'm not quite sure how to resolve it. Do you have any constructive suggestions?
https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/59f5f77e3ddbac3ed9db93ec2cfe99ed6c5d121d/cuda_rasterizer/rasterizer_impl.cu#L116

Oddly, I didn't encounter any issues when I directly loaded the saved input parameters for rendering.
dump_file.zip

pytorch 1.13.1
cuda 11.4
A100 SXM4 80G

After several days of debugging, I found that the concatenation of keys might have caused anomalies, leading to memory errors. By storing them separately instead of concatenating, this issue was resolved. However, another exception occurred in the FORWARD::render function, and the problem couldn't be reproduced with the parameters saved in debug mode.

Does it happen on the first iteration or later on? Can you check whether you have nans on any tensors fed to the rasterisation function?

@PanagiotisP Yes, the issue tends to occur after iterating hundreds of times, but sometimes it may take thousands of iterations to appear. I have checked the input parameters and have not found any NaN values. However, I noticed that the error occurs because currtile or prevtile exceeds the length of ranges, resulting in a memory out-of-bounds issue. At this point, currtile or prevtile are very large strange numbers, such as prevtile=3210786815 and currtile=1072937470. Do you have any insights on this?

No, I'm sorry. I'm not very familiar with the tiling procedure, so NaN was my best bet. In your place, I think, I would also check if extremely big scale values appear for some reason (e.g. due to a regularisation term gone bad). But other than that, your guess is as good as mine

@baoachun Did you solve this problem? I also got a very large strange numbers (currtile, prevtile).

@Devlee247 Yes, I changed the concatenation operation of ID and depth to store them separately, and that bug is fixed. However, I still encounter errors in the FORWARD::render function, and I suspect there are other unresolved issues in the FORWARD::render function.

@baoachun Thank you for sharing, I also fixed via adding torch.cuda.empty_cache() in the forward function. (GaussianSplatting pytorch Class)

@Devlee247 Thank you for sharing, unfortunately, this method does not solve my problem.

@PanagiotisP How to solve this problem which the gaussian primitives attributions turn to be Nan when I train for thousands iterations? And I check the Nan value and turn all nan to 0, but in the rest iterations, all the loss come to be nan.

I am not sure I can help with that, as nan propagates instantly to everything. Usually, you want to ensure that you don't make any obvious illegal operations, like dividing with zeros, taking the log of a non-positive number etc.