CUDA error: an illegal memory access was encountered
baoachun opened this issue · comments
I have identified that the identifyTileRanges
function is causing the issue, but I'm not quite sure how to resolve it. Do you have any constructive suggestions?
https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/59f5f77e3ddbac3ed9db93ec2cfe99ed6c5d121d/cuda_rasterizer/rasterizer_impl.cu#L116
Oddly, I didn't encounter any issues when I directly loaded the saved input parameters for rendering.
dump_file.zip
pytorch 1.13.1
cuda 11.4
A100 SXM4 80G
After several days of debugging, I found that the concatenation of keys might have caused anomalies, leading to memory errors. By storing them separately instead of concatenating, this issue was resolved. However, another exception occurred in the FORWARD::render
function, and the problem couldn't be reproduced with the parameters saved in debug mode.
Does it happen on the first iteration or later on? Can you check whether you have nans on any tensors fed to the rasterisation function?
@PanagiotisP Yes, the issue tends to occur after iterating hundreds of times, but sometimes it may take thousands of iterations to appear. I have checked the input parameters and have not found any NaN values. However, I noticed that the error occurs because currtile
or prevtile
exceeds the length of ranges
, resulting in a memory out-of-bounds issue. At this point, currtile
or prevtile
are very large strange numbers, such as prevtile=3210786815
and currtile=1072937470
. Do you have any insights on this?
No, I'm sorry. I'm not very familiar with the tiling procedure, so NaN was my best bet. In your place, I think, I would also check if extremely big scale values appear for some reason (e.g. due to a regularisation term gone bad). But other than that, your guess is as good as mine
@baoachun Did you solve this problem? I also got a very large strange numbers (currtile, prevtile).
@Devlee247 Yes, I changed the concatenation operation of ID and depth to store them separately, and that bug is fixed. However, I still encounter errors in the FORWARD::render
function, and I suspect there are other unresolved issues in the FORWARD::render
function.
@baoachun Thank you for sharing, I also fixed via adding torch.cuda.empty_cache() in the forward function. (GaussianSplatting pytorch Class)
@Devlee247 Thank you for sharing, unfortunately, this method does not solve my problem.
@PanagiotisP How to solve this problem which the gaussian primitives attributions turn to be Nan
when I train for thousands iterations? And I check the Nan value and turn all nan to 0, but in the rest iterations, all the loss come to be nan.
I am not sure I can help with that, as nan
propagates instantly to everything. Usually, you want to ensure that you don't make any obvious illegal operations, like dividing with zeros, taking the log of a non-positive number etc.