OSError when there are too many concurrent processes
siaimes opened this issue · comments
Organization Name: fzu
Short summary about the issue/question:
When I start multiple PyTorch DDP jobs at the same time, most of the processes crash after running several epochs with high probability, and the OSErrors are reported as follows:
[2022-03-27 08:57:41] ERROR: Uncaught exception:
Traceback (most recent call last):
File "main.py", line 55, in run_epoch
for i, data in enumerate(train_loader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
raise exception
OSError: Caught OSError in DataLoader worker process 5.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/mnt/csip-091/TorchDomain/torchdomain/datasets/folder.py", line 90, in __getitem__
return super(DomainFolder, self).__getitem__(idx) + self._get_domain(idx)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataset.py", line 308, in __getitem__
return self.datasets[dataset_idx][sample_idx]
File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 232, in __getitem__
sample = self.loader(path)
File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 269, in default_loader
return pil_loader(path)
File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 249, in pil_loader
with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/mnt/share/ImageNet/train/n02488702/n02488702_98.JPEG'
It looks like the file is missing, but the file actually exists.
So I guess maybe it's caused by too many concurrent processes?
How to solve this problem?
Brief what process you are following:
When there are too many concurrent processes accessing the same data set, an I/O error is reported.
How to reproduce it:
Run 5 jobs at the same time, each with 2 main processes and 6 dataloader processes.
In this way, there will be a total of 5*2*6=60
processes accessing /mnt/share/ImageNet/*
at the same time.
OpenPAI Environment:
- OpenPAI version: v1.8.0
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release): 18.04.5 LTS
- Kernel (e.g.
uname -a
):Linux 4.15.0-151-generic #157-Ubuntu SMP Fri Jul 9 23:07:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- Hardware (e.g. core number, memory size, storage size, GPU type etc.):
- Others:
Anything else we need to know:
The storage /mnt/share
is launched by nfs-kernel-server
of host machaine.
Any NFS related logs?
NFS has no logs by default, and it is not convenient for me to restart in the production environment. The issue will not appear when I reduced the number of jobs accessing the same dataset to 2 (24 processes).
here is my config for nfs-kernel-server
csip@csip-091:~$ cat /etc/exports
# /etc/exports: the access control list for filesystems which may be exported
# to NFS clients. See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4 gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes gss/krb5i(rw,sync,no_subtree_check)
/home/data 172.17.175.0/255.255.255.0(rw,fsid=0,async,no_subtree_check,no_auth_nlm,insecure,no_root_squash)
here is my pv/pvc:
root@csip-dev-box-openpai092:/cluster-configuration/storage# cat share.yaml
# replace 10.0.0.1 with your storage server IP
# NFS Persistent Volume
apiVersion: v1
kind: PersistentVolume
metadata:
name: share-pv
labels:
name: share
spec:
capacity:
storage: 30Ti
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
mountOptions:
- nfsvers=4.1
- soft
- retry=0
- retrans=1
- timeo=20
nfs:
path: /share
server: 172.17.175.90
---
# NFS Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: share
spec:
accessModes:
- ReadWriteMany
volumeMode: Filesystem
resources:
requests:
storage: 30Ti
selector:
matchLabels:
name: share