MIC-DKFZ / nnDetection

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Training doesn't start correctly

salusanga opened this issue · comments

Hi,
I work on a SLURM cluster, when I launch the training with an interactive job with SRUN the training starts correctly also with multiprocessing. When I submit a job via SBATCH (same GPUs, A6000 or A100) it prints the following:

"RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements
bypassing sigterm"

I already explicitly set OMP_NUM_THREADS=1.

I do not get the error if i set also the flag -o augment_cfg.multiprocessing=False when submitting with SBATCH, therefore it seems something related to multiprocessing.

nndet_env gives:

PyTorch Version: 1.10.2
PyTorch Debug: False
PyTorch CUDA: 11.3
PyTorch Backend cudnn: 8200
PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_37>
PyTorch Current Device Capability: (8, 6)
PyTorch CUDA available: True


----- System Information -----
System NVCC: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

System Arch List: 8.0 8.6
System OMP_NUM_THREADS: 1
System CUDA_HOME is None: True
System CPU Count: 128
Python Version: 3.8.18 (default, Sep 11 2023, 13:40:15)
[GCC 11.2.0]


----- nnDetection Information -----
det_num_threads 12
det_data is set True
det_models is set True

Thanks!

UPDATE: somehow the problem was related to det_num_threads=12, setting it up to 10 works.