Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.

Home Page:https://lightning.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The training process will stop unexpectedly

5huanghuai opened this issue · comments

Bug description

It seems to be caused by using callback or logger recording in multiple processes?

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

Traceback (most recent call last):
  File "/home/username/.conda/envs/envname/lib/python3.11/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/home/username/.conda/envs/envname/lib/python3.11/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/username/.conda/envs/envname/lib/python3.11/multiprocessing/util.py", line 133, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/username/.conda/envs/envname/lib/python3.11/shutil.py", line 737, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/home/username/.conda/envs/envname/lib/python3.11/shutil.py", line 735, in rmtree
    os.rmdir(path, dir_fd=dir_fd)
OSError: [Errno 39] Directory not empty: '/tmp/pymp-dsg3ubii'

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

Hey @5huanghuai
This report is too generic for us to help out. Would you mind please filling out the requested section of how to reproduce this?