repeated failures to launch `run_clm_no_trainer.py` from `run_opt_clm.sh`

Question

repeated failures to launch `run_clm_no_trainer.py` from `run_opt_clm.sh`

MEllis-github opened this issue 2 years ago · comments

Overview

The following error is repeatedly encountered running the benchmarking script run_opt_clm.sh (environment and steps to replicate are noted below). A full log for the run with GPUNUM=1 and MAX_JOBS=1 is attached error_1gpu1maxjob_092722.log.txt. Please advise on next steps for a successful launch.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 1082) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
run_clm_no_trainer.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-09-27_19:18:54
  host      : pytorchjob-test-09272022-145936-master-0
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 1082)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1082
======================================================

Steps taken

Simplified and reproduced the same behavior with the following:

running image: https://github.com/orgs/hpcaitech/packages/container/pytorch-cuda/42358580?tag=1.12.0-11.3.0

setup per https://github.com/hpcaitech/OPT-Benchmark#run-benchmarking:

note, the same behavior occurs with/without MAX_JOBS=1 set in the environment

git clone https://github.com/hpcaitech/OPT-Benchmark.git
cd OPT-Benchmark/
# assuming using cuda 11.3
conda install -y pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org
pip install accelerate==0.10.0 datasets==1.18.4 transformers==4.21.0 deepspeed==0.6.5 tqdm
# run with deepspeed zero 3 + offloading
# GPUNUM is per node
mkdir logs/
MEMCAP=80 GPUNUM=1 bash ./run_opt_clm.sh

Additional details

in the running container post installations:

sh-4.4# which torchrun
/opt/conda/bin/torchrun
sh-4.4# python --version
Python 3.9.12
sh-4.4# pip --version
pip 22.1.2 from /opt/conda/lib/python3.9/site-packages/pip (python 3.9)
sh-4.4# pip3 --version
pip 22.1.2 from /opt/conda/lib/python3.9/site-packages/pip (python 3.9)
sh-4.4# conda -V
conda 22.9.0

sh-4.4# conda info

     active environment : None
       user config file : /root/.condarc
 populated config files :
          conda version : 22.9.0
    conda-build version : not installed
         python version : 3.9.12.final.0
       virtual packages : __cuda=11.6=0
                          __linux=4.18.0=0
                          __glibc=2.28=0
                          __unix=0=0
                          __archspec=1=x86_64
       base environment : /opt/conda  (writable)
      conda av data dir : /opt/conda/etc/conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /opt/conda/pkgs
                          /root/.conda/pkgs
       envs directories : /opt/conda/envs
                          /root/.conda/envs
               platform : linux-64
             user-agent : conda/22.9.0 requests/2.27.1 CPython/3.9.12 Linux/4.18.0-305.45.1.el8_4.x86_64 rhel/8.5 glibc/2.28
                UID:GID : 0:0
             netrc file : None
           offline mode : False
           sh-4.4# conda list
           # packages in environment at /opt/conda:
           #
           # Name                    Version                   Build  Channel
           _libgcc_mutex             0.1                        main  
           _openmp_mutex             4.5                       1_gnu  
           accelerate                0.10.0                   pypi_0    pypi
           aiohttp                   3.8.3                    pypi_0    pypi
           aiosignal                 1.2.0                    pypi_0    pypi
           apex                      0.1                      pypi_0    pypi
           async-timeout             4.0.2                    pypi_0    pypi
           attrs                     22.1.0                   pypi_0    pypi
           bcrypt                    4.0.0                    pypi_0    pypi
           blas                      1.0                         mkl  
           brotlipy                  0.7.0           py39h27cfd23_1003  
           bzip2                     1.0.8                h7b6447c_0  
           ca-certificates           2022.07.19           h06a4308_0  
           certifi                   2022.9.14        py39h06a4308_0  
           cffi                      1.15.0           py39hd667e15_1  
           cfgv                      3.3.1                    pypi_0    pypi
           charset-normalizer        2.0.4              pyhd3eb1b0_0  
           click                     8.1.3                    pypi_0    pypi
           colorama                  0.4.4              pyhd3eb1b0_0  
           colossalai                0.1.10+torch1.11cu11.3          pypi_0    pypi
           commonmark                0.9.1                    pypi_0    pypi
           conda                     22.9.0           py39h06a4308_0  
           conda-content-trust       0.1.1              pyhd3eb1b0_0  
           conda-package-handling    1.8.1            py39h7f8727e_0  
           contexttimer              0.3.3                    pypi_0    pypi
           cryptography              36.0.0           py39h9ce1e76_0  
           cudatoolkit               11.3.1               h2bc3f7f_2  
           datasets                  1.18.4                   pypi_0    pypi
           deepspeed                 0.6.5                    pypi_0    pypi
           dill                      0.3.5.1                  pypi_0    pypi
           distlib                   0.3.6                    pypi_0    pypi
           fabric                    2.7.1                    pypi_0    pypi
           ffmpeg                    4.3                  hf484d3e_0    pytorch
           filelock                  3.8.0                    pypi_0    pypi
           freetype                  2.11.0               h70c0345_0  
           frozenlist                1.3.1                    pypi_0    pypi
           fsspec                    2022.8.2                 pypi_0    pypi
           giflib                    5.2.1                h7b6447c_0  
           gmp                       6.2.1                h295c915_3  
           gnutls                    3.6.15               he1e5248_0  
           hjson                     3.1.0                    pypi_0    pypi
           huggingface-hub           0.9.1                    pypi_0    pypi
           identify                  2.5.5                    pypi_0    pypi
           idna                      3.3                pyhd3eb1b0_0  
           intel-openmp              2021.4.0          h06a4308_3561  
           invoke                    1.7.1                    pypi_0    pypi
           jpeg                      9e                   h7f8727e_0  
           lame                      3.100                h7b6447c_0  
           lcms2                     2.12                 h3be6417_0  
           ld_impl_linux-64          2.35.1               h7274673_9  
           libffi                    3.3                  he6710b0_2  
           libgcc-ng                 9.3.0               h5101ec6_17  
           libgomp                   9.3.0               h5101ec6_17  
           libiconv                  1.16                 h7f8727e_2  
           libidn2                   2.3.2                h7f8727e_0  
           libpng                    1.6.37               hbc83047_0  
           libstdcxx-ng              9.3.0               hd4cf53a_17  
           libtasn1                  4.16.0               h27cfd23_0  
           libtiff                   4.2.0                h2818925_1  
           libunistring              0.9.10               h27cfd23_0  
           libuv                     1.40.0               h7b6447c_0  
           libwebp                   1.2.2                h55f646e_0  
           libwebp-base              1.2.2                h7f8727e_0  
           lz4-c                     1.9.3                h295c915_1  
           mkl                       2021.4.0           h06a4308_640  
           mkl-service               2.4.0            py39h7f8727e_0  
           mkl_fft                   1.3.1            py39hd3c417c_0  
           mkl_random                1.2.2            py39h51133e4_0  
           multidict                 6.0.2                    pypi_0    pypi
           multiprocess              0.70.13                  pypi_0    pypi
           ncurses                   6.3                  h7f8727e_2  
           nettle                    3.7.3                hbbd107a_1  
           ninja                     1.10.2.3                 pypi_0    pypi
           nodeenv                   1.7.0                    pypi_0    pypi
           numpy                     1.22.3           py39he7a7128_0  
           numpy-base                1.22.3           py39hf524024_0  
           openh264                  2.1.1                h4ff587b_0  
           openssl                   1.1.1q               h7f8727e_0  
           packaging                 21.3                     pypi_0    pypi
           pandas                    1.5.0                    pypi_0    pypi
           paramiko                  2.11.0                   pypi_0    pypi
           pathlib2                  2.3.7.post1              pypi_0    pypi
           pillow                    9.0.1            py39h22f2fdc_0  
           pip                       22.1.2           py39h06a4308_0  
           platformdirs              2.5.2                    pypi_0    pypi
           pre-commit                2.20.0                   pypi_0    pypi
           psutil                    5.9.2                    pypi_0    pypi
           py-cpuinfo                8.0.0                    pypi_0    pypi
           pyarrow                   9.0.0                    pypi_0    pypi
           pycosat                   0.6.3            py39h27cfd23_0  
           pycparser                 2.21               pyhd3eb1b0_0  
           pygments                  2.13.0                   pypi_0    pypi
           pynacl                    1.5.0                    pypi_0    pypi
           pyopenssl                 22.0.0             pyhd3eb1b0_0  
           pyparsing                 3.0.9                    pypi_0    pypi
           pysocks                   1.7.1            py39h06a4308_0  
           python                    3.9.12               h12debd9_0  
           python-dateutil           2.8.2                    pypi_0    pypi
           pytorch                   1.11.0          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
           pytorch-mutex             1.0                        cuda    pytorch
           pytz                      2022.2.1                 pypi_0    pypi
           pyyaml                    6.0                      pypi_0    pypi
           readline                  8.1.2                h7f8727e_1  
           regex                     2022.9.13                pypi_0    pypi
           requests                  2.27.1             pyhd3eb1b0_0  
           responses                 0.18.0                   pypi_0    pypi
           rich                      12.5.1                   pypi_0    pypi
           ruamel_yaml               0.15.100         py39h27cfd23_0  
           setuptools                61.2.0           py39h06a4308_0  
           six                       1.16.0             pyhd3eb1b0_1  
           sqlite                    3.38.2               hc218d9a_0  
           tk                        8.6.11               h1ccaba5_0  
           tokenizers                0.12.1                   pypi_0    pypi
           toml                      0.10.2                   pypi_0    pypi
           toolz                     0.11.2             pyhd3eb1b0_0  
           torchaudio                0.11.0               py39_cu113    pytorch
           torchvision               0.12.0               py39_cu113    pytorch
           tqdm                      4.63.0             pyhd3eb1b0_0  
           transformers              4.21.0                   pypi_0    pypi
           typing_extensions         4.1.1              pyh06a4308_0  
           tzdata                    2022a                hda174b7_0  
           urllib3                   1.26.8             pyhd3eb1b0_0  
           virtualenv                20.16.5                  pypi_0    pypi
           wheel                     0.37.1             pyhd3eb1b0_0  
           xxhash                    3.0.0                    pypi_0    pypi
           xz                        5.2.5                h7b6447c_0  
           yaml                      0.2.5                h7b6447c_0  
           yarl                      1.8.1                    pypi_0    pypi
           zlib                      1.2.12               h7f8727e_1  
           zstd                      1.5.2                ha4553b6_0  

sh-4.4# python
Python 3.9.12 (main, Apr  5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.11.0'
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> print(torch.version.cuda)
11.3
>>> print(torch._C._cuda_getCompiledVersion(), 'cuda compiled version')
11030 cuda compiled version

John · Answer 1 · Wed Sep 28 2022 16:23:54 GMT+0800 (China Standard Time)

Following mentioned steps, I could not reproduce your error mentioned in the issue, I will plot your issue in slack and hope this could help!!

MEllis-github · Answer 2 · Wed Oct 12 2022 23:50:36 GMT+0800 (China Standard Time)

Directly running the torchrun command as shown in the script or modified as follows with 1 gpu leads to the same error.

sh-4.4# export MEMCAP=80; export BS=32; export MODEL="13b"
sh-4.4# torchrun --nnodes=${WORLD_SIZE} \
>                                  --node_rank=${RANK} \
>                                  --nproc_per_node=1 \
>                                  --rdzv_id=101 \
>                                  --rdzv_endpoint="pytorchjob-master-0:$MASTER_PORT" \
>                                  run_clm_no_trainer.py \
>                                  --dataset_name wikitext \
>                                  --dataset_config_name wikitext-2-raw-v1 \
>                                  --model_name_or_path facebook/opt-${MODEL} \
>                                  --output_dir /tmp/test-clm \
>                                  --mem_cap ${MEMCAP} \
>                                  --per_device_train_batch_size ${BS} 2>&1 | tee logs/ds_${MODEL}_bs_${BS}_cap_${MEMCAP}_gpu_${WORLD_SIZE}.log

Error log tail:

pytorchjob-master-0:1141:1235 [0] NCCL INFO comm 0x7f140c002fb0 rank 0 nranks 1 cudaDev 0 busId 8010 - Init COMPLETE
[2022-10-12 15:28:27,873] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[2022-10-12 15:28:27,874] [INFO] [engine.py:1086:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2022-10-12 15:28:27,874] [INFO] [engine.py:1092:_configure_optimizer] Using client Optimizer as basic optimizer
[2022-10-12 15:28:27,951] [INFO] [engine.py:1108:_configure_optimizer] DeepSpeed Basic Optimizer = AdamW
[2022-10-12 15:28:27,951] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-10-12 15:28:27,951] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2022-10-12 15:28:27,951] [INFO] [engine.py:1410:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2022-10-12 15:28:27,961] [INFO] [stage3.py:275:__init__] Reduce bucket size 500000000
[2022-10-12 15:28:27,961] [INFO] [stage3.py:276:__init__] Prefetch bucket size 50000000
Using /root/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu113/utils...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Using envvar MAX_JOBS (1) as the number of workers...
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/lib/python3.9/site-packages/torch/include -isystem /opt/conda/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.9/site-packages/torch/include/THC -isystem /opt/conda/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /opt/conda/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/opt/conda/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 12.750416040420532 seconds
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 1141) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
run_clm_no_trainer.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-10-12_15:31:55
  host      : pytorchjob-test-10122022-105113-master-0
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 1141)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1141
======================================================

MEllis-github · Answer 3 · Thu Oct 13 2022 00:31:34 GMT+0800 (China Standard Time)

Using a smaller model (facebook/opt-125m) than the default for the 1x80GB gpu run got past this error. It crashed ~9% into stage 3. I will follow-up on the latter in a separate issue if necessary.