ValueError: No slot '1' specified on host 'localhost'
ArlanCooper opened this issue · comments
cooper commented
Reminder
- I have searched the Github Discussion and issues and have not found anything similar to this.
Environment
- OS:ubuntu22.04
- Python:3.10.12
- PyTorch:2.0.1
- CUDA:11.8
Current Behavior
运行官网代码进行微调,只是先指定使用底二块gpu,因为我这边是4块A100,代码如下:
CUDA_VISIBLE_DEVICES=1 bash finetune/scripts/run_sft_Yi_6b.sh
Expected Behavior
No response
Steps to Reproduce
运行代码:
CUDA_VISIBLE_DEVICES=1 bash finetune/scripts/run_sft_Yi_6b.sh
报错信息:
[2024-02-06 10:28:18,407] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-06 10:28:20,906] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=1: setting --include=localhost:1
Traceback (most recent call last):
File "/home/powerop/work/conda/envs/yi/bin/deepspeed", line 6, in <module>
main()
File "/home/powerop/work/conda/envs/yi/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 426, in main
active_resources = parse_inclusion_exclusion(resource_pool, args.include, args.exclude)
File "/home/powerop/work/conda/envs/yi/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 350, in parse_inclusion_exclusion
return parse_resource_filter(active_resources, include_str=inclusion, exclude_str=exclusion)
File "/home/powerop/work/conda/envs/yi/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 302, in parse_resource_filter
raise ValueError(f"No slot '{slot}' specified on host '{hostname}'")
ValueError: No slot '1' specified on host 'localhost'
Anything Else?
No response
cooper commented
按照官方文档:不能通过CUDA_VISIBLE_DEVICES指定,需要使用deepspeed --include localhost:1才可以,已经解决