deep-learning-with-pytorch / dlwpt-code

Code for the book Deep Learning with PyTorch by Eli Stevens, Luca Antiga, and Thomas Viehmann.

Home Page:https://www.manning.com/books/deep-learning-with-pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

p2ch12 (training.py)- Training stops without error

opened this issue · comments

When I run "python -m training --balanced --epochs 11", the training process will be shut down automated on epoch 3 without an error message. I try a lot of times and get the same result. It makes me confuse because there is no error message.
Environments:
Conda: 1.9.12
PyTorch: 1.7.0
Cuda: 10.2
RAM: 32 GB
GPU: RTX 2080 Ti
I think I meet the same issue on #17.
2021-01-20 144504

Same issue for me. I ran the code in Jupyter notebook. It said kernel dead. But when I run the training in cmd, there's no any error messages. Also, the same issue, but not answered earlier: #17

I set the training epoch as 20, but the training always stops at 2.

@Russell-Chang @melhzy Hello, I created the issue #17 . I was able to resolve it by reducing the number of workers and batch size. The issue is caused due to your running out of memory. It doesnt show in task manager exactly but that is whats happening. If you want to look you can click on the 3d/copy button in the performance tab on the task manager and click cuda.
You can experiment with the batch size and number of workers to go as high as you can without crashing. This will depend on the capability of your system. I used 32 batch size with 4 workers and the training completed for me although it took some hours.

Thanks, @navpreetnp7 . I see my cuda remains at 99% while training. Do you mean if the algorithm pushes cuda to more than 100%, the Python kernel will be forced to stop?
gpu

You can try to test if the error comes from your dataloader containing hidden files like .ipynb_checkpoints, try to write a script to loop over your dataloader and see if it crashes.