MIC-DKFZ / nnDetection

I have trained the toy dataset successfully, but when I want to train my own dataset, some wired error occurs. I have already full_check the dataset and it's okay. Thanks a lot in advance for any help.

Here is the error occurs detections_per_img: 100
score_thresh: 0
topk_candidates: 10000
remove_small_boxes: 0.01
nms_thresh: 0.6
2024-02-23 12:47:28.962 | INFO 2024-02-23 12:47:29.058 | INFO 2024-02-23 12:47:29.058 | INFO 2024-02-23 12:47:36.843 | INFO 2024-02-23 12:47:37.252 | INFO 2024-02-23 12:47:37.394 | INFO 2024-02-23 12:47:37.394 | INFO 2024-02-23 12:47:37.394 | INFO 2024-02-23 12:47:37.395 | INFO 2024-02-23 12:47:37.395 | INFO 2024-02-23 12:47:37.548 | INFO 2024-02-23 12:47:37.575 | INFO 2024-02-23 12:47:37.575 | INFO 2024-02-23 12:47:37.585 | INFO 2024-02-23 12:47:37.593 | INFO 2024-02-23 12:47:37.604 | INFO 2024-02-23 12:47:37.605 | INFO 2024-02-23 12:47:37.615 | INFO 2024-02-23 12:47:37.615 | INFO 2024-02-23 12:47:37.615 | INFO 2024-02-23 12:47:37.616 | INFO 2024-02-23 12:47:37.616 | INFO detections_per_img: 100
score_thresh: 0
topk_candidates: 10000
remove_small_boxes: 0.01
nms_thresh: 0.6
2024-02-23 12:47:37.724 | INFO 2024-02-23 12:47:37.805 | INFO 2024-02-23 12:47:37.805 | INFO 2024-02-23 12:48:07.135 | INFO You can try to repro while preprocessing:
| nndet.planning.estimator:estimate:123 - Found available gpu memory: 16919691264 bytes / 16135.875 mb and estimating for 11511726080 bytes / 10978.4375
| nndet.planning.estimator:_estimate_mem_available:154 - Estimating in memory.
| nndet.planning.estimator:measure:193 - Estimating on cuda:0 with shape [1, 64, 224, 192] and batch size 4 and num_instances 5
| nndet.planning.estimator:measure:242 - Caught error (If out of memory error do not worry): CUDA out of memory. Tried to allocate 488.00 MiB (GPU 0; 15.90 GiB total capacity; 13.86 GiB already allocated; 231.75 MiB free; 14.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
| nndet.planning.estimator:measure:256 - Measured: 0.0 mb empty, inf mb fixed, inf mb dynamic
| nndet.ptmodule.retinaunet.base:from_config_plan:362 - Architecture overwrites: {} Anchor overwrites: {}
| nndet.ptmodule.retinaunet.base:from_config_plan:364 - Building architecture according to plan of not_found
| nndet.ptmodule.retinaunet.base:from_config_plan:367 - Start channels: 32; head channels: 128; fpn channels: 128
| nndet.core.boxes.anchors:init:288 - Discarding anchor generator kwargs {'stride': 1}
| nndet.ptmodule.retinaunet.base:_build_encoder:464 - Building:: encoder Encoder: {}
| nndet.ptmodule.retinaunet.base:_build_decoder:496 - Building:: decoder UFPNModular: {'min_out_channels': 8, 'upsampling_mode': 'transpose', 'num_lateral': 1, 'norm_lateral': False, 'activation_lateral': False, 'num_out': 1, 'norm_out': False, 'activation_out': False}
| nndet.core.boxes.matcher.atss:init:45 - Running ATSS Matching with num_candidates=4 and center_in_gt False.
| nndet.ptmodule.retinaunet.base:_build_head_classifier:530 - Building:: classifier BCECLassifier: {'num_convs': 1, 'norm_channels_per_group': 16, 'norm_affine': True, 'reduction': 'mean', 'loss_weight': 1.0, 'prior_prob': 0.01}
| nndet.arch.heads.classifier:init_weights:215 - Init classifier weights: prior prob 0.01
| nndet.ptmodule.retinaunet.base:_build_head_regressor:564 - Building:: regressor GIoURegressor: {'num_convs': 1, 'norm_channels_per_group': 16, 'norm_affine': True, 'reduction': 'sum', 'loss_weight': 1.0, 'learn_scale': True}
| nndet.arch.heads.regressor:build_scales:150 - Learning level specific scalar in regressor
| nndet.arch.heads.regressor:init_weights:196 - Overwriting regressor conv weight init
| nndet.ptmodule.retinaunet.base:_build_head:602 - Building:: head DetectionHeadHNMNative: {} sampler HardNegativeSamplerBatched: {'batch_size_per_image': 32, 'positive_fraction': 0.33, 'pool_size': 20, 'min_neg': 1}
| nndet.core.boxes.sampler:init:235 - Sampling hard negatives on a per batch basis
| nndet.ptmodule.retinaunet.base:_build_segmenter:638 - Building:: segmenter DiCESegmenterFgBg {'dice_kwargs': {'batch_dice': True, 'smooth_nom': 1e-05, 'smooth_denom': 1e-05, 'do_bg': False}}
| nndet.losses.segmentation:init:108 - Running batch dice True and do bg False in dice loss.
| nndet.ptmodule.retinaunet.base:from_config_plan:421 - Model Inference Summary:
| nndet.planning.estimator:estimate:123 - Found available gpu memory: 16919691264 bytes / 16135.875 mb and estimating for 11511726080 bytes / 10978.4375
| nndet.planning.estimator:_estimate_mem_available:154 - Estimating in memory.
| nndet.planning.estimator:measure:193 - Estimating on cuda:0 with shape [1, 64, 192, 192] and batch size 4 and num_instances 5
| nndet.planning.estimator:measure:242 - Caught error (If out of memory error do not worry): cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 32, 64, 192, 192], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv3d(32, 32, kernel_size=[1, 3, 3], padding=[0, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [0, 1, 1]
stride = [1, 1, 1]
dilation = [1, 1, 1]
groups = 1
deterministic = true
allow_tf32 = true
input: TensorDescriptor 0x7f1bf00f32f0
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 64, 192, 192,
strideA = 75497472, 2359296, 36864, 192, 1,
output: TensorDescriptor 0x7f1bf00f53e0
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 32, 64, 192, 192,
strideA = 75497472, 2359296, 36864, 192, 1,
weight: FilterDescriptor 0x7f1bf00f5c10
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 5
dimA = 32, 32, 1, 3, 3,
Pointer addresses:
input: 0x7f1998e00000
output: 0x7f17f8000000
weight: 0x7f18d47fa000
Additional pointer addresses:
grad_output: 0x7f17f8000000
grad_input: 0x7f1998e00000
Backward data algorithm: 3

Here is the Traceback imformation:
Traceback (most recent call last):
File "/home/wangpeiyu/anaconda3/envs/nndetection/bin/nndet_prep", line 33, in
sys.exit(load_entry_point('nndet', 'console_scripts', 'nndet_prep')())
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/utils/check.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 418, in main
run(OmegaConf.to_container(cfg, resolve=True),
File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 347, in run
run_planning_and_process(
File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 174, in run_planning_and_process
plan_identifiers = planner.plan_experiment(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/experiment/v001.py", line 43, in plan_experiment
plan_3d = self.plan_base_stage(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/experiment/base.py", line 234, in plan_base_stage
architecture_plan = architecture_planner.plan(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/c002.py", line 127, in plan
res = super().plan(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/base.py", line 346, in plan
patch_size = self._plan_architecture(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/c002.py", line 205, in _plan_architecture
_, fits_in_mem = self.estimator.estimate(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 128, in estimate
res = self._estimate_mem_available(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 155, in _estimate_mem_available
fixed, dynamic = self.measure(shape=target_shape,
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 253, in measure
network.cpu()
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 714, in cpu
return self._apply(lambda t: t.cpu())
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply
module._apply(fn)
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply
module._apply(fn)
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 597, in _apply
param_applied = fn(param)
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 714, in
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

And following is the nndet_env:
----- PyTorch Information -----
PyTorch Version: 1.10.1+cu111
PyTorch Debug: False
PyTorch CUDA: 11.1
PyTorch Backend cudnn: 8005
PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
PyTorch Current Device Capability: (6, 0)
PyTorch CUDA available: True

----- System Information -----
System NVCC: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0

System Arch List: None
System OMP_NUM_THREADS: 1
System CUDA_HOME is None: True
System CPU Count: 32
Python Version: 3.9.0 (default, Nov 15 2020, 14:28:56)
[GCC 7.3.0]

----- nnDetection Information -----
det_num_threads 6
det_data is set True
det_models is set True

Dear @karon999 ,

can you start it with CUDA_LAUNCH_BLOCKING=1 or check for any other inconsistencies? The error does not really show what is going wrong right now.

Best,
Michael

After adding CUDA_LAUNCH_BLOCKING=1
Now this is the error
2024-02-29 03:40:25.625 | INFO | nndet.planning.estimator:measure:193 - Estimating on cuda:0 with shape [1, 64, 192, 192] and batch size 4 and num_instances 5
2024-02-29 03:40:55.235 | INFO | nndet.planning.estimator:measure:242 - Caught error (If out of memory error do not worry): cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 128, 64, 48, 48], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv3d(128, 128, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [1, 1, 1]
stride = [1, 1, 1]
dilation = [1, 1, 1]
groups = 1
deterministic = true
allow_tf32 = true
input: TensorDescriptor 0x7ff0bc040170
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 128, 64, 48, 48,
strideA = 18874368, 147456, 2304, 48, 1,
output: TensorDescriptor 0x7ff0bc0088a0
type = CUDNN_DATA_HALF
nbDims = 5
dimA = 4, 128, 64, 48, 48,
strideA = 18874368, 147456, 2304, 48, 1,
weight: FilterDescriptor 0x564c610a2de0
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 5
dimA = 128, 128, 3, 3, 3,
Pointer addresses:
input: 0x7fee27800000
output: 0x7fee1e800000
weight: 0x7ff034ed8000
Additional pointer addresses:
grad_output: 0x7fee1e800000
grad_weight: 0x7ff034ed8000
Backward filter algorithm: 1

And this is the traceback:
Traceback (most recent call last):
File "/home/wangpeiyu/anaconda3/envs/nndetection/bin/nndet_prep", line 33, in
sys.exit(load_entry_point('nndet', 'console_scripts', 'nndet_prep')())
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/utils/check.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 423, in main
run(OmegaConf.to_container(cfg, resolve=True),
File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 352, in run
run_planning_and_process(
File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 179, in run_planning_and_process
plan_identifiers = planner.plan_experiment(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/experiment/v001.py", line 43, in plan_experiment
plan_3d = self.plan_base_stage(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/experiment/base.py", line 234, in plan_base_stage
architecture_plan = architecture_planner.plan(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/c002.py", line 127, in plan
res = super().plan(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/base.py", line 346, in plan
patch_size = self._plan_architecture(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/c002.py", line 205, in _plan_architecture
_, fits_in_mem = self.estimator.estimate(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 128, in estimate
res = self._estimate_mem_available(
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 155, in _estimate_mem_available
fixed, dynamic = self.measure(shape=target_shape,
File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 253, in measure
network.cpu()
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 714, in cpu
return self._apply(lambda t: t.cpu())
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply
module._apply(fn)
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply
module._apply(fn)
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 597, in _apply
param_applied = fn(param)
File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 714, in
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: an illegal memory access was encountered

I tried to generalise the error handling a little bit, can you please pull the latest github version and try again (make sure that you have nndetection installed with the -e option to the changes also take place).

okay, I will download the latest nndetection and have a try.

Thanks for the suggestion, I've updated to the latest version. But something wierd happens. When I select about 50 nii images from the entire dataset to put into ImagesTr, the preprocess works fine. However, when around 100 images are put in, the error (RuntimeError: CUDA error: an illegal memory access was encountered) still occurs. I've looked into it, and it could be because the batchsize or imagesize is too big for the memory (I'm using a gpu with 16G memory), do you think this is a valid reason? If so, how can I modify it? Thanks again for your help, I will be very appreciate it if you could help to solve this problem.
Best,
Karon

Or is there a possibility that I could preprocess a portion of the dataset at a time and combine them at the end, and if so, which files would I need to make changes to?

Sorry for the delay, I needed to take care of several deadlines.

Unfortunately it is not possible to run the preprocessing on parts of the dataset since the properties need to be extracted from the entire dataset first.

There is a loop which iteratively reduces the patch size until it fits into the memory, due to some reason, the error which is raised during this process (the Our of Memory error) is not catched correctly und thus the whole program crashes. One solution which comes to my mind, would be to modify these lines

nnDetection/nndet/planning/architecture/boxes/c002.py

Line 299 in 8032b8d

def _get_initial_patch_size(target_spacing_transposed: np.ndarray,

manually and reduce it to something that is slightly larger than the final patch size.

This issue is stale because it has been open for 30 days with no activity.

This issue was closed because it has been inactive for 14 days since being marked as stale.

ERROR while preprocessing