p2ch12 (training.py)- Training stops without error

Question

p2ch12 (training.py)- Training stops without error

opened this issue 3 years ago · comments

When I run "python -m training --balanced --epochs 11", the training process will be shut down automated on epoch 3 without an error message. I try a lot of times and get the same result. It makes me confuse because there is no error message.
Environments:
Conda: 1.9.12
PyTorch: 1.7.0
Cuda: 10.2
RAM: 32 GB
GPU: RTX 2080 Ti
I think I meet the same issue on #17.

Ziyuan Huang · Answer 1 · Fri Feb 05 2021 17:01:51 GMT+0800 (China Standard Time)

Same issue for me. I ran the code in Jupyter notebook. It said kernel dead. But when I run the training in cmd, there's no any error messages. Also, the same issue, but not answered earlier: #17

I set the training epoch as 20, but the training always stops at 2.

Navpreet Singh · Answer 2 · Sat Feb 06 2021 07:57:53 GMT+0800 (China Standard Time)

@Russell-Chang @melhzy Hello, I created the issue #17 . I was able to resolve it by reducing the number of workers and batch size. The issue is caused due to your running out of memory. It doesnt show in task manager exactly but that is whats happening. If you want to look you can click on the 3d/copy button in the performance tab on the task manager and click cuda.
You can experiment with the batch size and number of workers to go as high as you can without crashing. This will depend on the capability of your system. I used 32 batch size with 4 workers and the training completed for me although it took some hours.

Ziyuan Huang · Answer 3 · Sat Feb 06 2021 09:17:37 GMT+0800 (China Standard Time)

Thanks, @navpreetnp7 . I see my cuda remains at 99% while training. Do you mean if the algorithm pushes cuda to more than 100%, the Python kernel will be forced to stop?

EliottGDFY · Answer 4 · Wed Aug 23 2023 16:53:46 GMT+0800 (China Standard Time)

You can try to test if the error comes from your dataloader containing hidden files like .ipynb_checkpoints, try to write a script to loop over your dataloader and see if it crashes.