RuntimeError: CUDA error: invalid device ordinal

Question

RuntimeError: CUDA error: invalid device ordinal

zxy333666 opened this issue a year ago · comments

请教下大佬这个怎么调？

───────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/chatglm2-6b-code/Finetune-ChatGLM2-6B/main.py:377 in │
│ │
│ 374 │
│ 375 │
│ 376 if name == "main": │
│ ❱ 377 │ main() │
│ 378 │
│ │
│ /data/chatglm2-6b-code/Finetune-ChatGLM2-6B/main.py:61 in main │
│ │
│ 58 │ │ # let's parse it to get our arguments. │
│ 59 │ │ model_args, data_args, training_args = parser.parse_json_file(json_file=os.path. │
│ 60 │ else: │
│ ❱ 61 │ │ model_args, data_args, training_args = parser.parse_args_into_dataclasses() │
│ 62 │ # Setup logging │
│ 63 │ logging.basicConfig( │
│ 64 │ │ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/hf_argparser.py:332 in │
│ parse_args_into_dataclasses │
│ │
│ 329 │ │ │ inputs = {k: v for k, v in vars(namespace).items() if k in keys} │
│ 330 │ │ │ for k in keys: │
│ 331 │ │ │ │ delattr(namespace, k) │
│ ❱ 332 │ │ │ obj = dtype(**inputs) │
│ 333 │ │ │ outputs.append(obj) │
│ 334 │ │ if len(namespace.dict) > 0: │
│ 335 │ │ │ # additional namespace. │
│ in init:113 │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1227 in post_init │
│ │
│ 1224 │ │ if ( │
│ 1225 │ │ │ self.framework == "pt" │
│ 1226 │ │ │ and is_torch_available() │
│ ❱ 1227 │ │ │ and (self.device.type != "cuda") │
│ 1228 │ │ │ and (get_xla_device_type(self.device) != "GPU") │
│ 1229 │ │ │ and (self.fp16 or self.fp16_full_eval) │
│ 1230 │ │ ): │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1660 in device │
│ │
│ 1657 │ │ The device used by this process. │
│ 1658 │ │ """ │
│ 1659 │ │ requires_backends(self, ["torch"]) │
│ ❱ 1660 │ │ return self._setup_devices │
│ 1661 │ │
│ 1662 │ @Property │
│ 1663 │ def n_gpu(self): │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:54 in get │
│ │
│ 51 │ │ attr = "_cached" + self.fget.name │
│ 52 │ │ cached = getattr(obj, attr, None) │
│ 53 │ │ if cached is None: │
│ ❱ 54 │ │ │ cached = self.fget(obj) │
│ 55 │ │ │ setattr(obj, attr, cached) │
│ 56 │ │ return cached │
│ 57 │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1650 in _setup_devices │
│ │
│ 1647 │ │ │
│ 1648 │ │ if device.type == "cuda": │
│ 1649 │ │ │ print(f"------------device--------:{device}") │
│ ❱ 1650 │ │ │ torch.cuda.set_device(device) │
│ 1651 │ │ │
│ 1652 │ │ return device │
│ 1653 │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/cuda/init.py:326 in set_device │
│ │
│ 323 │ """ │
│ 324 │ device = _get_device_index(device) │
│ 325 │ if device >= 0: │
│ ❱ 326 │ │ torch._C._cuda_setDevice(device) │
│ 327 │
│ 328 │
│ 329 def get_device_name(device: Optional[_device_t] = None) -> str: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

sh ds_train_finetune.sh
[2023-07-05 08:26:05,121] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2023-07-05 08:26:05,168] [INFO] [runner.py:541:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=8888 --enable_each_rank_log=None main.py --deepspeed deepspeed.json --do_train --do_eval --train_file /data/chatglm2-6b-code/Finetune-ChatGLM2-6B/data/131w/train.json --validation_file /data/chatglm2-6b-code/Finetune-ChatGLM2-6B/data/131w/validate.json --prompt_column conversations --overwrite_cache --model_name_or_path /data/chatglm2-6b --output_dir /data/chatglm2-6b-code/Finetune-ChatGLM2-6B/output/output0705-1 --overwrite_output_dir --max_length 762 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 12 --predict_with_generate --num_train_epochs 3 --logging_steps 50 --save_steps 1000000 --learning_rate 6e-6 --do_eval False --fp16 True --save_total_limit 5
[2023-07-05 08:26:10,601] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-07-05 08:26:10,601] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-07-05 08:26:10,601] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-07-05 08:26:10,601] [INFO] [launch.py:247:main] dist_world_size=8
[2023-07-05 08:26:10,601] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-07-05 08:26:20,093] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
------------device--------:cuda:3
------------device--------:cuda:6
------------device--------:cuda:5
------------device--------:cuda:0
------------device--------:cuda:1
------------device--------:cuda:4
------------device--------:cuda:7
------------device--------:cuda:2
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
07/05/2023 08:26:21 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
07/05/2023 08:26:21 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True

export CUDA_VISIBLE_DEVICES=0,1,2 好像不起作用

包大人 · Answer 1 · Thu Jul 06 2023 10:44:38 GMT+0800 (China Standard Time)

改一下启动命令的GPU数量