RuntimeError: merge_sort: failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered

Question

RuntimeError: merge_sort: failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered

hlyyyyy opened this issue 8 months ago · comments

Hi, firstly thanks for the great job of spconv, when I train CenterPoint with spconv, I met the error below:
`
Traceback (most recent call last):
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/spconv/pytorch/conv.py", line 385, in _conv_forward
res = ops.get_indice_pairs_implicit_gemm(
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/spconv/pytorch/ops.py", line 550, in get_indice_pairs_implicit_gemm
SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j],
RuntimeError: merge_sort: failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./tools/train.py", line 274, in
main()
File "./tools/train.py", line 263, in main
train_model(
File "/work/mmdetection3d/mmdet3d/apis/train.py", line 346, in train_model
train_detector(
File "/work/mmdetection3d/mmdet3d/apis/train.py", line 321, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 64, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/work/mmdetection3d/mmdet3d/models/detectors/base.py", line 60, in forward
return self.forward_train(**kwargs)
File "/work/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 273, in forward_train
img_feats, pts_feats = self.extract_feat(
File "/work/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 208, in extract_feat
pts_feats = self.extract_pts_feat(points, img_feats, img_metas)
File "/work/mmdetection3d/mmdet3d/models/detectors/centerpoint.py", line 50, in extract_pts_feat
x = self.pts_middle_encoder(voxel_features, coors, batch_size)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/work/mmdetection3d/mmdet3d/models/middle_encoders/sparse_encoder.py", line 131, in forward
x = encoder_layer(x)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/spconv/pytorch/modules.py", line 138, in forward
input = module(input)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/work/mmdetection3d/mmdet3d/ops/sparse_block.py", line 125, in forward
out = self.conv2(out)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/spconv/pytorch/conv.py", line 755, in forward
return self._conv_forward(self.training,
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/spconv/pytorch/conv.py", line 402, in _conv_forward
msg += f"indices={indices.shape},bs={batch_size},ss={spatial_shape},"
File "/work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/_tensor.py", line 659, in format
return self.item().format(format_spec)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fea6fd001ee in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x26e61 (0x7fea977a5e61 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x257 (0x7fea977aadb7 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x466858 (0x7feabeffc858 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fea6fce77a5 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: + 0x362735 (0x7feabeef8735 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x67c6c8 (0x7feabf2126c8 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7feabf212a95 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x10be3b (0x55961048be3b in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #9: + 0x116888 (0x559610496888 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #10: + 0x1293aa (0x5596104a93aa in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #11: + 0x1293aa (0x5596104a93aa in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #12: + 0x123fb8 (0x5596104a3fb8 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #13: + 0x136a8f (0x5596104b6a8f in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #14: + 0x13695c (0x5596104b695c in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #15: + 0x13695c (0x5596104b695c in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #16: + 0x13695c (0x5596104b695c in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #17: + 0x13695c (0x5596104b695c in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #18: + 0x13695c (0x5596104b695c in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #19: + 0x13695c (0x5596104b695c in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #20: + 0x10aa49 (0x55961048aa49 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #21: PyDict_SetItemString + 0x4a (0x55961055581a in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #22: PyImport_Cleanup + 0xa4 (0x55961057ba54 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #23: Py_FinalizeEx + 0x7a (0x55961057ab0a in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #24: Py_RunMain + 0x112 (0x559610576632 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #25: Py_BytesMain + 0x39 (0x55961054e219 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
frame #26: __libc_start_main + 0xf3 (0x7feae4a5f0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #27: + 0x1ce125 (0x55961054e125 in /work/fusion_3dod-dev-dino_deformable_transfusion/venv/bin/python)
`

my environment:
CUDA: 11.6
python: 3.8.13
mmdetection3d: 1.0.0cr4
spconv: spconv-cu116 with pip install
cumm: cumm-cu116 with pip install
GPU: A800

my config:
point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0] voxel_size = [0.075, 0.075, 0.2] pts_middle_encoder=dict( type='SparseEncoder', in_channels=4, sparse_shape=[41, 1440, 1440], output_channels=128, order=('conv', 'norm', 'act'), encoder_channels=((16, 16, 32), (32, 32, 64), (64, 64, 128), (128, 128)), encoder_paddings=((0, 0, 1), (0, 0, 1), (0, 0, [0, 1, 1]), (0, 0)), block_type='basicblock'),

I'm sure that I have assigned the right sparse shape, but still encounter the error, looking forward your reply, thanks!

XinhaoT · Answer 1 · Thu Oct 19 2023 14:31:58 GMT+0800 (China Standard Time)

same problem

Donglin Yang · Answer 2 · Mon Jan 29 2024 19:08:10 GMT+0800 (China Standard Time)

I got the same problem
[Exception|implicit_gemm_pair]indices=torch.Size([7204, 3]),bs=10,ss=[32, 32],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3],stride=[1, 1],padding=[0, 0],dilation=[1, 1],subm=True,transpose=False
SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attached in a issue.
Traceback (most recent call last):
File "/fs01/home/ydlin718/QuadtreeDiffusion/test_sparse.py", line 78, in
output = sparsemodel(input, tt, masks, lq=lq)
File "/h/ydlin718/miniconda3/envs/ResShift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/fs01/home/ydlin718/QuadtreeDiffusion/model/unet.py", line 1332, in forward
h = module(h, emb, masks[h.shape[-1]], hin)
File "/h/ydlin718/miniconda3/envs/ResShift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/fs01/home/ydlin718/QuadtreeDiffusion/model/unet_util.py", line 69, in forward
x = layer(x, emb, mask, h)
File "/h/ydlin718/miniconda3/envs/ResShift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/fs01/home/ydlin718/QuadtreeDiffusion/model/unet_util.py", line 673, in forward
output = self.res_block(output, emb)
File "/h/ydlin718/miniconda3/envs/ResShift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/fs01/home/ydlin718/QuadtreeDiffusion/model/unet_util.py", line 449, in forward
h = self.in_conv(self.in_rest(x))
File "/h/ydlin718/miniconda3/envs/ResShift/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/h/ydlin718/miniconda3/envs/ResShift/lib/python3.9/site-packages/spconv/pytorch/conv.py", line 755, in forward
return self._conv_forward(self.training,
File "/h/ydlin718/miniconda3/envs/ResShift/lib/python3.9/site-packages/spconv/pytorch/conv.py", line 408, in _conv_forward
raise e
File "/h/ydlin718/miniconda3/envs/ResShift/lib/python3.9/site-packages/spconv/pytorch/conv.py", line 385, in _conv_forward
res = ops.get_indice_pairs_implicit_gemm(
File "/h/ydlin718/miniconda3/envs/ResShift/lib/python3.9/site-packages/spconv/pytorch/ops.py", line 550, in get_indice_pairs_implicit_gemm
SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j],
RuntimeError: merge_sort: failed on 2nd step: cudaErrorECCUncorrectable: uncorrectable ECC error encountered