AssertionError: Loading a checkpoint for MP=0 but world size is 1
IsraelAbebe opened this issue · comments
Israel Abebe commented
Any idea what this error is and why it happens
AssertionError: Loading a checkpoint for MP=0 but world size is 1
[2023-12-31 16:19:42,100] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 115) of binary: /usr/bin/python3.9
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/azime/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-12-31_16:19:42
host : azime-36475.0-balder.hpc.uni-saarland.de
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 115)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
my inferance code looks like this
torchrun --nproc_per_node 1 example.py \
--ckpt_dir LLaMA-7B/7B \
--tokenizer_path LLaMA-7B/tokenizer.model \
--adapter_path LLaMA-7B/llama_adapter_len10_layer30_release.pth \
--quantizer False
and I used THIS weights , with the adapters from this repo.
LLaMA-7B/
├── checklist.chk
├── consolidated.00.pth
├── llama_adapter_len10_layer30_release.pth
├── params.json
├── README.md
└── tokenizer.model
Jiaming Han commented
Check if meta-llama/llama#40 helps
Haolan commented
I also find this problem, try to print the "ckpt_dir " in your code, I suspect it's the problem with Fire python library failing to correctly parse the arguments.
I worked around this by using argparse.