The training process will stop unexpectedly
5huanghuai opened this issue · comments
shanfenglantu commented
Bug description
It seems to be caused by using callback or logger recording in multiple processes?
What version are you seeing the problem on?
v2.2
How to reproduce the bug
No response
Error messages and logs
Traceback (most recent call last):
File "/home/username/.conda/envs/envname/lib/python3.11/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/home/username/.conda/envs/envname/lib/python3.11/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/username/.conda/envs/envname/lib/python3.11/multiprocessing/util.py", line 133, in _remove_temp_dir
rmtree(tempdir)
File "/home/username/.conda/envs/envname/lib/python3.11/shutil.py", line 737, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/home/username/.conda/envs/envname/lib/python3.11/shutil.py", line 735, in rmtree
os.rmdir(path, dir_fd=dir_fd)
OSError: [Errno 39] Directory not empty: '/tmp/pymp-dsg3ubii'
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
Adrian Wälchli commented
Hey @5huanghuai
This report is too generic for us to help out. Would you mind please filling out the requested section of how to reproduce this?