[Question] Training doesn't start correctly

Question

[Question] Training doesn't start correctly

salusanga opened this issue 6 months ago · comments

Hi,
I work on a SLURM cluster, when I launch the training with an interactive job with SRUN the training starts correctly also with multiprocessing. When I submit a job via SBATCH (same GPUs, A6000 or A100) it prints the following:

"RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements
bypassing sigterm"

I already explicitly set OMP_NUM_THREADS=1.

I do not get the error if i set also the flag -o augment_cfg.multiprocessing=False when submitting with SBATCH, therefore it seems something related to multiprocessing.

nndet_env gives:

PyTorch Version: 1.10.2
PyTorch Debug: False
PyTorch CUDA: 11.3
PyTorch Backend cudnn: 8200
PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_37>
PyTorch Current Device Capability: (8, 6)
PyTorch CUDA available: True


----- System Information -----
System NVCC: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

System Arch List: 8.0 8.6
System OMP_NUM_THREADS: 1
System CUDA_HOME is None: True
System CPU Count: 128
Python Version: 3.8.18 (default, Sep 11 2023, 13:40:15)
[GCC 11.2.0]


----- nnDetection Information -----
det_num_threads 12
det_data is set True
det_models is set True

Thanks!

UPDATE: somehow the problem was related to det_num_threads=12, setting it up to 10 works.