“No backend type associated with device type cpu” when run cli_demo_sat.py

Question

“No backend type associated with device type cpu” when run cli_demo_sat.py

yileld opened this issue 4 months ago · comments

Traceback (most recent call last):
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
    main()
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
    model, model_args = AutoModel.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 367, in from_pretrained
    mp_split_model_receive(model, use_node_group=use_node_group)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 91, in mp_split_model_receive
    iter_repartition(model)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
    iter_repartition(sub_module)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
    iter_repartition(sub_module)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 84, in iter_repartition
    torch.distributed.recv(sub_module.weight.data, src)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1640, in recv
    pg.recv([tensor], src, tag).wait()
RuntimeError: No backend type associated with device type cpu
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[2024-03-05 14:32:43,744] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 50878 closing signal SIGTERM

原来能跑起来的，现在又不行了，是sat又更新了吗？
目前版本torch=2.1.2，sat=0.4.11，transformers=4.38.2

Qingsong Lv · Answer 1 · Tue Mar 05 2024 14:45:43 GMT+0800 (China Standard Time)

如果想用cpu运行，请确保CUDA_VISIBLE_DEVICES=空

ylying · Answer 2 · Tue Mar 05 2024 14:48:43 GMT+0800 (China Standard Time)

如果想用cpu运行，请确保CUDA_VISIBLE_DEVICES=空

是想用GPU运行的，但是有quant 8，所以AutoModel.from_pretrained()一开始是在CPU上吧

Qingsong Lv · Answer 3 · Tue Mar 05 2024 14:52:23 GMT+0800 (China Standard Time)

quant 8暂时不支持overwrite_arge={'model_parallel_size'}

ylying · Answer 4 · Tue Mar 05 2024 14:57:18 GMT+0800 (China Standard Time)

quant 8暂时不支持overwrite_arge={'model_parallel_size'}

那难道是我记错了。。。所以目前quant是不支持多卡推理的是吧

另外我改成bf16报错

Traceback (most recent call last):
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
    main()
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
    model, model_args = AutoModel.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 368, in from_pretrained
    reset_random_seed(6)
  File "/usr/local/lib/python3.10/dist-packages/sat/arguments.py", line 572, in reset_random_seed
    assert _GLOBAL_RANDOM_SEED is not None, "You have not set random seed. No need to reset it."
AssertionError: You have not set random seed. No need to reset it.

Qingsong Lv · Answer 5 · Tue Mar 05 2024 15:00:05 GMT+0800 (China Standard Time)

是的，因为quant切分的状态我也不知道怎么均分到不同卡上……取决于量化算法