ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)
kgopfa opened this issue · comments
Kudzai Gopfa commented
Problem Description
After completing setup for CodeLlama, from the README.md, when I attempt to run any of the examples, with the specified commands:
torchrun --nproc_per_node 1 example_completion.py --ckpt_dir CodeLlama-7b/ --tokenizer_path CodeLlama-7b/tokenizer.model --max_seq_len 128 --max_batch_size 4
OR
torchrun --nproc_per_node 1 example_infilling.py --ckpt_dir CodeLlama-7b/ --tokenizer_path CodeLlama-7b/tokenizer.model --max_seq_len 192 --max_batch_size 4
OR
torchrun --nproc_per_node 1 example_instructions.py --ckpt_dir CodeLlama-7b-Instruct/ --tokenizer_path CodeLlama-7b-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 4
I get the output with the error below:
Output
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 31383) of binary: /home/abc/miniconda3/envs/llama_env/bin/python
Traceback (most recent call last):
File "/home/abc/miniconda3/envs/llama_env/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
example_completion.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-12-10_13:12:17
host : ABC-PC.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 31383)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 31383
======================================================
Runtime Environment
- Model: [
CodeLlama-7b
,CodeLlama-7b-Instruct
,CodeLlama-7b-Python
] - Using via huggingface?: [no]
- OS: [Linux/Ubuntu (via WSL2), Windows]
- GPU VRAM: 4GB
- Number of GPUs: 1
- GPU Make: [Nvidia]
- GPU Version: NVIDIA GeForce GTX 1650
Additional context
I am trying to run the models on Ubuntu through WSL 2, I tried setting the batch size to 6 (--max_batch_size 6
) as was mentioned in llama #706 but this did not help.
ckeisc807 commented
I met the same issue. I find that I ran out of RAM via htop
. Try .wslconfig
to enable more RAM.