ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)

Question

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)

kgopfa opened this issue 6 months ago · comments

Problem Description

After completing setup for CodeLlama, from the README.md, when I attempt to run any of the examples, with the specified commands:

torchrun --nproc_per_node 1 example_completion.py --ckpt_dir CodeLlama-7b/ --tokenizer_path CodeLlama-7b/tokenizer.model --max_seq_len 128 --max_batch_size 4

OR

torchrun --nproc_per_node 1 example_infilling.py --ckpt_dir CodeLlama-7b/ --tokenizer_path CodeLlama-7b/tokenizer.model --max_seq_len 192 --max_batch_size 4

OR

torchrun --nproc_per_node 1 example_instructions.py --ckpt_dir CodeLlama-7b-Instruct/ --tokenizer_path CodeLlama-7b-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 4

I get the output with the error below:

Output

> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 31383) of binary: /home/abc/miniconda3/envs/llama_env/bin/python
Traceback (most recent call last):
  File "/home/abc/miniconda3/envs/llama_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/abc/miniconda3/envs/llama_env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
example_completion.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-10_13:12:17
  host      : ABC-PC.
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 31383)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 31383
======================================================

Runtime Environment

Model: [CodeLlama-7b, CodeLlama-7b-Instruct, CodeLlama-7b-Python]
Using via huggingface?: [no]
OS: [Linux/Ubuntu (via WSL2), Windows]
GPU VRAM: 4GB
Number of GPUs: 1
GPU Make: [Nvidia]
GPU Version: NVIDIA GeForce GTX 1650

Additional context
I am trying to run the models on Ubuntu through WSL 2, I tried setting the batch size to 6 (--max_batch_size 6) as was mentioned in llama #706 but this did not help.

ckeisc807 · Answer 1 · Wed Jan 03 2024 14:44:06 GMT+0800 (China Standard Time)

I met the same issue. I find that I ran out of RAM via htop. Try .wslconfig to enable more RAM.