microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OSError when there are too many concurrent processes

siaimes opened this issue · comments

Organization Name: fzu

Short summary about the issue/question:
When I start multiple PyTorch DDP jobs at the same time, most of the processes crash after running several epochs with high probability, and the OSErrors are reported as follows:

[2022-03-27 08:57:41] ERROR: Uncaught exception:
Traceback (most recent call last):
  File "main.py", line 55, in run_epoch
    for i, data in enumerate(train_loader):

  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()

  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)

  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()

  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
OSError: Caught OSError in DataLoader worker process 5.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/mnt/csip-091/TorchDomain/torchdomain/datasets/folder.py", line 90, in __getitem__
    return super(DomainFolder, self).__getitem__(idx) + self._get_domain(idx)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataset.py", line 308, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 232, in __getitem__
    sample = self.loader(path)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 269, in default_loader
    return pil_loader(path)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 249, in pil_loader
    with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/mnt/share/ImageNet/train/n02488702/n02488702_98.JPEG'

It looks like the file is missing, but the file actually exists.

So I guess maybe it's caused by too many concurrent processes?

How to solve this problem?

Brief what process you are following:
When there are too many concurrent processes accessing the same data set, an I/O error is reported.

How to reproduce it:

Run 5 jobs at the same time, each with 2 main processes and 6 dataloader processes.

In this way, there will be a total of 5*2*6=60 processes accessing /mnt/share/ImageNet/* at the same time.

OpenPAI Environment:

  • OpenPAI version: v1.8.0
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): 18.04.5 LTS
  • Kernel (e.g. uname -a): Linux 4.15.0-151-generic #157-Ubuntu SMP Fri Jul 9 23:07:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.):
  • Others:

Anything else we need to know:
The storage /mnt/share is launched by nfs-kernel-server of host machaine.

Any NFS related logs?

NFS has no logs by default, and it is not convenient for me to restart in the production environment. The issue will not appear when I reduced the number of jobs accessing the same dataset to 2 (24 processes).

here is my config for nfs-kernel-server

csip@csip-091:~$ cat /etc/exports 
# /etc/exports: the access control list for filesystems which may be exported
#		to NFS clients.  See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes       hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4        gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes  gss/krb5i(rw,sync,no_subtree_check)
/home/data 172.17.175.0/255.255.255.0(rw,fsid=0,async,no_subtree_check,no_auth_nlm,insecure,no_root_squash)

here is my pv/pvc:

root@csip-dev-box-openpai092:/cluster-configuration/storage# cat share.yaml 
# replace 10.0.0.1 with your storage server IP
# NFS Persistent Volume
apiVersion: v1
kind: PersistentVolume
metadata:
  name: share-pv
  labels:
    name: share
spec:
  capacity:
    storage: 30Ti
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  mountOptions:
    - nfsvers=4.1
    - soft
    - retry=0
    - retrans=1
    - timeo=20
  nfs:
    path: /share
    server: 172.17.175.90
---
# NFS Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: share
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 30Ti
  selector:
    matchLabels:
      name: share