01-ai / Yi

A series of large language models trained from scratch by developers @01-ai

Home Page:https://01.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: No slot '1' specified on host 'localhost'

ArlanCooper opened this issue · comments

Reminder

  • I have searched the Github Discussion and issues and have not found anything similar to this.

Environment

- OS:ubuntu22.04
- Python:3.10.12
- PyTorch:2.0.1
- CUDA:11.8

Current Behavior

运行官网代码进行微调,只是先指定使用底二块gpu,因为我这边是4块A100,代码如下:


CUDA_VISIBLE_DEVICES=1 bash finetune/scripts/run_sft_Yi_6b.sh

Expected Behavior

No response

Steps to Reproduce

运行代码:

CUDA_VISIBLE_DEVICES=1 bash finetune/scripts/run_sft_Yi_6b.sh

报错信息:

[2024-02-06 10:28:18,407] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-06 10:28:20,906] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=1: setting --include=localhost:1
Traceback (most recent call last):
  File "/home/powerop/work/conda/envs/yi/bin/deepspeed", line 6, in <module>
    main()
  File "/home/powerop/work/conda/envs/yi/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 426, in main
    active_resources = parse_inclusion_exclusion(resource_pool, args.include, args.exclude)
  File "/home/powerop/work/conda/envs/yi/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 350, in parse_inclusion_exclusion
    return parse_resource_filter(active_resources, include_str=inclusion, exclude_str=exclusion)
  File "/home/powerop/work/conda/envs/yi/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 302, in parse_resource_filter
    raise ValueError(f"No slot '{slot}' specified on host '{hostname}'")
ValueError: No slot '1' specified on host 'localhost'

Anything Else?

No response

按照官方文档:不能通过CUDA_VISIBLE_DEVICES指定,需要使用deepspeed --include localhost:1才可以,已经解决