AssertionError: Loading a checkpoint for MP=0 but world size is 1

Question

AssertionError: Loading a checkpoint for MP=0 but world size is 1

IsraelAbebe opened this issue 5 months ago · comments

Any idea what this error is and why it happens

AssertionError: Loading a checkpoint for MP=0 but world size is 1
[2023-12-31 16:19:42,100] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 115) of binary: /usr/bin/python3.9
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-31_16:19:42
  host      : azime-36475.0-balder.hpc.uni-saarland.de
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 115)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

my inferance code looks like this

torchrun --nproc_per_node 1 example.py \
--ckpt_dir LLaMA-7B/7B \
--tokenizer_path LLaMA-7B/tokenizer.model \
--adapter_path LLaMA-7B/llama_adapter_len10_layer30_release.pth \
--quantizer False

and I used THIS weights , with the adapters from this repo.

LLaMA-7B/
├── checklist.chk
├── consolidated.00.pth
├── llama_adapter_len10_layer30_release.pth
├── params.json
├── README.md
└── tokenizer.model

Jiaming Han · Answer 1 · Thu Jan 04 2024 20:46:52 GMT+0800 (China Standard Time)

Check if meta-llama/llama#40 helps

Haolan · Answer 2 · Wed Jan 24 2024 04:34:44 GMT+0800 (China Standard Time)

I also find this problem, try to print the "ckpt_dir " in your code, I suspect it's the problem with Fire python library failing to correctly parse the arguments.

I worked around this by using argparse.