repeated failures to launch `run_clm_no_trainer.py` from `run_opt_clm.sh`
MEllis-github opened this issue · comments
Overview
The following error is repeatedly encountered running the benchmarking script run_opt_clm.sh
(environment and steps to replicate are noted below). A full log for the run with GPUNUM=1 and MAX_JOBS=1 is attached error_1gpu1maxjob_092722.log.txt. Please advise on next steps for a successful launch.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 1082) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
run_clm_no_trainer.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-09-27_19:18:54
host : pytorchjob-test-09272022-145936-master-0
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1082)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1082
======================================================
Steps taken
Simplified and reproduced the same behavior with the following:
- running image: https://github.com/orgs/hpcaitech/packages/container/pytorch-cuda/42358580?tag=1.12.0-11.3.0
- setup per https://github.com/hpcaitech/OPT-Benchmark#run-benchmarking:
-
note, the same behavior occurs with/without MAX_JOBS=1 set in the environment
git clone https://github.com/hpcaitech/OPT-Benchmark.git cd OPT-Benchmark/ # assuming using cuda 11.3 conda install -y pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org pip install accelerate==0.10.0 datasets==1.18.4 transformers==4.21.0 deepspeed==0.6.5 tqdm # run with deepspeed zero 3 + offloading # GPUNUM is per node mkdir logs/ MEMCAP=80 GPUNUM=1 bash ./run_opt_clm.sh
-
Additional details
in the running container post installations:
sh-4.4# which torchrun
/opt/conda/bin/torchrun
sh-4.4# python --version
Python 3.9.12
sh-4.4# pip --version
pip 22.1.2 from /opt/conda/lib/python3.9/site-packages/pip (python 3.9)
sh-4.4# pip3 --version
pip 22.1.2 from /opt/conda/lib/python3.9/site-packages/pip (python 3.9)
sh-4.4# conda -V
conda 22.9.0
sh-4.4# conda info
active environment : None
user config file : /root/.condarc
populated config files :
conda version : 22.9.0
conda-build version : not installed
python version : 3.9.12.final.0
virtual packages : __cuda=11.6=0
__linux=4.18.0=0
__glibc=2.28=0
__unix=0=0
__archspec=1=x86_64
base environment : /opt/conda (writable)
conda av data dir : /opt/conda/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /opt/conda/pkgs
/root/.conda/pkgs
envs directories : /opt/conda/envs
/root/.conda/envs
platform : linux-64
user-agent : conda/22.9.0 requests/2.27.1 CPython/3.9.12 Linux/4.18.0-305.45.1.el8_4.x86_64 rhel/8.5 glibc/2.28
UID:GID : 0:0
netrc file : None
offline mode : False
sh-4.4# conda list
# packages in environment at /opt/conda:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 4.5 1_gnu
accelerate 0.10.0 pypi_0 pypi
aiohttp 3.8.3 pypi_0 pypi
aiosignal 1.2.0 pypi_0 pypi
apex 0.1 pypi_0 pypi
async-timeout 4.0.2 pypi_0 pypi
attrs 22.1.0 pypi_0 pypi
bcrypt 4.0.0 pypi_0 pypi
blas 1.0 mkl
brotlipy 0.7.0 py39h27cfd23_1003
bzip2 1.0.8 h7b6447c_0
ca-certificates 2022.07.19 h06a4308_0
certifi 2022.9.14 py39h06a4308_0
cffi 1.15.0 py39hd667e15_1
cfgv 3.3.1 pypi_0 pypi
charset-normalizer 2.0.4 pyhd3eb1b0_0
click 8.1.3 pypi_0 pypi
colorama 0.4.4 pyhd3eb1b0_0
colossalai 0.1.10+torch1.11cu11.3 pypi_0 pypi
commonmark 0.9.1 pypi_0 pypi
conda 22.9.0 py39h06a4308_0
conda-content-trust 0.1.1 pyhd3eb1b0_0
conda-package-handling 1.8.1 py39h7f8727e_0
contexttimer 0.3.3 pypi_0 pypi
cryptography 36.0.0 py39h9ce1e76_0
cudatoolkit 11.3.1 h2bc3f7f_2
datasets 1.18.4 pypi_0 pypi
deepspeed 0.6.5 pypi_0 pypi
dill 0.3.5.1 pypi_0 pypi
distlib 0.3.6 pypi_0 pypi
fabric 2.7.1 pypi_0 pypi
ffmpeg 4.3 hf484d3e_0 pytorch
filelock 3.8.0 pypi_0 pypi
freetype 2.11.0 h70c0345_0
frozenlist 1.3.1 pypi_0 pypi
fsspec 2022.8.2 pypi_0 pypi
giflib 5.2.1 h7b6447c_0
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
hjson 3.1.0 pypi_0 pypi
huggingface-hub 0.9.1 pypi_0 pypi
identify 2.5.5 pypi_0 pypi
idna 3.3 pyhd3eb1b0_0
intel-openmp 2021.4.0 h06a4308_3561
invoke 1.7.1 pypi_0 pypi
jpeg 9e h7f8727e_0
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.35.1 h7274673_9
libffi 3.3 he6710b0_2
libgcc-ng 9.3.0 h5101ec6_17
libgomp 9.3.0 h5101ec6_17
libiconv 1.16 h7f8727e_2
libidn2 2.3.2 h7f8727e_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.3.0 hd4cf53a_17
libtasn1 4.16.0 h27cfd23_0
libtiff 4.2.0 h2818925_1
libunistring 0.9.10 h27cfd23_0
libuv 1.40.0 h7b6447c_0
libwebp 1.2.2 h55f646e_0
libwebp-base 1.2.2 h7f8727e_0
lz4-c 1.9.3 h295c915_1
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py39h7f8727e_0
mkl_fft 1.3.1 py39hd3c417c_0
mkl_random 1.2.2 py39h51133e4_0
multidict 6.0.2 pypi_0 pypi
multiprocess 0.70.13 pypi_0 pypi
ncurses 6.3 h7f8727e_2
nettle 3.7.3 hbbd107a_1
ninja 1.10.2.3 pypi_0 pypi
nodeenv 1.7.0 pypi_0 pypi
numpy 1.22.3 py39he7a7128_0
numpy-base 1.22.3 py39hf524024_0
openh264 2.1.1 h4ff587b_0
openssl 1.1.1q h7f8727e_0
packaging 21.3 pypi_0 pypi
pandas 1.5.0 pypi_0 pypi
paramiko 2.11.0 pypi_0 pypi
pathlib2 2.3.7.post1 pypi_0 pypi
pillow 9.0.1 py39h22f2fdc_0
pip 22.1.2 py39h06a4308_0
platformdirs 2.5.2 pypi_0 pypi
pre-commit 2.20.0 pypi_0 pypi
psutil 5.9.2 pypi_0 pypi
py-cpuinfo 8.0.0 pypi_0 pypi
pyarrow 9.0.0 pypi_0 pypi
pycosat 0.6.3 py39h27cfd23_0
pycparser 2.21 pyhd3eb1b0_0
pygments 2.13.0 pypi_0 pypi
pynacl 1.5.0 pypi_0 pypi
pyopenssl 22.0.0 pyhd3eb1b0_0
pyparsing 3.0.9 pypi_0 pypi
pysocks 1.7.1 py39h06a4308_0
python 3.9.12 h12debd9_0
python-dateutil 2.8.2 pypi_0 pypi
pytorch 1.11.0 py3.9_cuda11.3_cudnn8.2.0_0 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2022.2.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
readline 8.1.2 h7f8727e_1
regex 2022.9.13 pypi_0 pypi
requests 2.27.1 pyhd3eb1b0_0
responses 0.18.0 pypi_0 pypi
rich 12.5.1 pypi_0 pypi
ruamel_yaml 0.15.100 py39h27cfd23_0
setuptools 61.2.0 py39h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.38.2 hc218d9a_0
tk 8.6.11 h1ccaba5_0
tokenizers 0.12.1 pypi_0 pypi
toml 0.10.2 pypi_0 pypi
toolz 0.11.2 pyhd3eb1b0_0
torchaudio 0.11.0 py39_cu113 pytorch
torchvision 0.12.0 py39_cu113 pytorch
tqdm 4.63.0 pyhd3eb1b0_0
transformers 4.21.0 pypi_0 pypi
typing_extensions 4.1.1 pyh06a4308_0
tzdata 2022a hda174b7_0
urllib3 1.26.8 pyhd3eb1b0_0
virtualenv 20.16.5 pypi_0 pypi
wheel 0.37.1 pyhd3eb1b0_0
xxhash 3.0.0 pypi_0 pypi
xz 5.2.5 h7b6447c_0
yaml 0.2.5 h7b6447c_0
yarl 1.8.1 pypi_0 pypi
zlib 1.2.12 h7f8727e_1
zstd 1.5.2 ha4553b6_0
sh-4.4# python
Python 3.9.12 (main, Apr 5 2022, 06:56:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.11.0'
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> print(torch.version.cuda)
11.3
>>> print(torch._C._cuda_getCompiledVersion(), 'cuda compiled version')
11030 cuda compiled version
Following mentioned steps, I could not reproduce your error mentioned in the issue, I will plot your issue in slack and hope this could help!!
Directly running the torchrun command as shown in the script or modified as follows with 1 gpu leads to the same error.
sh-4.4# export MEMCAP=80; export BS=32; export MODEL="13b"
sh-4.4# torchrun --nnodes=${WORLD_SIZE} \
> --node_rank=${RANK} \
> --nproc_per_node=1 \
> --rdzv_id=101 \
> --rdzv_endpoint="pytorchjob-master-0:$MASTER_PORT" \
> run_clm_no_trainer.py \
> --dataset_name wikitext \
> --dataset_config_name wikitext-2-raw-v1 \
> --model_name_or_path facebook/opt-${MODEL} \
> --output_dir /tmp/test-clm \
> --mem_cap ${MEMCAP} \
> --per_device_train_batch_size ${BS} 2>&1 | tee logs/ds_${MODEL}_bs_${BS}_cap_${MEMCAP}_gpu_${WORLD_SIZE}.log
Error log tail:
pytorchjob-master-0:1141:1235 [0] NCCL INFO comm 0x7f140c002fb0 rank 0 nranks 1 cudaDev 0 busId 8010 - Init COMPLETE
[2022-10-12 15:28:27,873] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[2022-10-12 15:28:27,874] [INFO] [engine.py:1086:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2022-10-12 15:28:27,874] [INFO] [engine.py:1092:_configure_optimizer] Using client Optimizer as basic optimizer
[2022-10-12 15:28:27,951] [INFO] [engine.py:1108:_configure_optimizer] DeepSpeed Basic Optimizer = AdamW
[2022-10-12 15:28:27,951] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-10-12 15:28:27,951] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2022-10-12 15:28:27,951] [INFO] [engine.py:1410:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2022-10-12 15:28:27,961] [INFO] [stage3.py:275:__init__] Reduce bucket size 500000000
[2022-10-12 15:28:27,961] [INFO] [stage3.py:276:__init__] Prefetch bucket size 50000000
Using /root/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu113/utils...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Using envvar MAX_JOBS (1) as the number of workers...
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/lib/python3.9/site-packages/torch/include -isystem /opt/conda/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.9/site-packages/torch/include/THC -isystem /opt/conda/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /opt/conda/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] c++ flatten_unflatten.o -shared -L/opt/conda/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 12.750416040420532 seconds
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 1141) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
run_clm_no_trainer.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-10-12_15:31:55
host : pytorchjob-test-10122022-105113-master-0
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1141)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1141
======================================================
Using a smaller model (facebook/opt-125m) than the default for the 1x80GB gpu run got past this error. It crashed ~9% into stage 3. I will follow-up on the latter in a separate issue if necessary.