The training process will stop unexpectedly

Question

The training process will stop unexpectedly

5huanghuai opened this issue 2 months ago · comments

shanfenglantu commented 2 months ago

Bug description

It seems to be caused by using callback or logger recording in multiple processes?

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

Traceback (most recent call last):
  File "/home/username/.conda/envs/envname/lib/python3.11/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/username/.conda/envs/envname/lib/python3.11/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.conda/envs/envname/lib/python3.11/multiprocessing/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/username/.conda/envs/envname/lib/python3.11/shutil.py", line 737, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/home/username/.conda/envs/envname/lib/python3.11/shutil.py", line 735, in rmtree
    os.rmdir(path, dir_fd=dir_fd)
OSError: [Errno 39] Directory not empty: '/tmp/pymp-dsg3ubii'

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

Adrian Wälchli · Answer 1 · Sun Jun 02 2024 21:39:46 GMT+0800 (China Standard Time)

Hey @5huanghuai
This report is too generic for us to help out. Would you mind please filling out the requested section of how to reproduce this?