deepspeed分布式训练出现sat ValueError inconsistent

Question

deepspeed分布式训练出现sat ValueError inconsistent

elesun2018 opened this issue 7 months ago · comments

elesun2018 commented 7 months ago

deepspeed hostfile多机多卡分布式训练时出现以下问题：
Traceback (most recent call last):
worker0: File "finetune_XrayGLM.py", line 173, in
worker0: args = get_args(args_list)
worker0: File "/home/sfz/soft/miniconda3/envs/test/lib/python3.8/site-packages/sat/arguments.py", line 360, in get_args
worker0: raise ValueError(
worker0: ValueError: LOCAL_RANK (default 0) and args.device inconsistent. This can only happens in inference mode. Please use CUDA_VISIBLE_DEVICES=x for single-GPU training.
worker0: [2023-12-14 14:49:37,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 9305
worker0: [2023-12-14 14:49:37,663] [ERROR] [launch.py:321:sigkill_handler] ['/home/sfz/soft/miniconda3/envs/test/bin/python', '-u', 'finetune_XrayGLM.py', '--local_rank=0', '--experiment-name', 'finetune-CityGLM', '--model-parallel-size', '2', '--mode', 'finetune', '--train-iters', '10000', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './data/changjing9/data.json', '--valid-data', './data/changjing9/data.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '2000', '--eval-interval', '2000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '4', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '6', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1

Qingsong Lv · Answer 1 · Fri Dec 15 2023 09:57:58 GMT+0800 (China Standard Time)

XrayGLM相关的问题需要在XrayGLM仓库解决，因为我们也不太清楚他的代码是怎么写的……