THUDM / SwissArmyTransformer

SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.

Home Page:https://THUDM.github.io/SwissArmyTransformer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

“No backend type associated with device type cpu” when run cli_demo_sat.py

yileld opened this issue · comments

Traceback (most recent call last):
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
    main()
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
    model, model_args = AutoModel.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 367, in from_pretrained
    mp_split_model_receive(model, use_node_group=use_node_group)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 91, in mp_split_model_receive
    iter_repartition(model)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
    iter_repartition(sub_module)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
    iter_repartition(sub_module)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 84, in iter_repartition
    torch.distributed.recv(sub_module.weight.data, src)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1640, in recv
    pg.recv([tensor], src, tag).wait()
RuntimeError: No backend type associated with device type cpu
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[2024-03-05 14:32:43,744] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 50878 closing signal SIGTERM

原来能跑起来的,现在又不行了,是sat又更新了吗?
目前版本torch=2.1.2,sat=0.4.11,transformers=4.38.2

如果想用cpu运行,请确保CUDA_VISIBLE_DEVICES=空

如果想用cpu运行,请确保CUDA_VISIBLE_DEVICES=空

是想用GPU运行的,但是有quant 8,所以AutoModel.from_pretrained()一开始是在CPU上吧

quant 8暂时不支持overwrite_arge={'model_parallel_size'}

quant 8暂时不支持overwrite_arge={'model_parallel_size'}

那难道是我记错了。。。所以目前quant是不支持多卡推理的是吧

另外我改成bf16报错

Traceback (most recent call last):
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
    main()
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
    model, model_args = AutoModel.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 368, in from_pretrained
    reset_random_seed(6)
  File "/usr/local/lib/python3.10/dist-packages/sat/arguments.py", line 572, in reset_random_seed
    assert _GLOBAL_RANDOM_SEED is not None, "You have not set random seed. No need to reset it."
AssertionError: You have not set random seed. No need to reset it.

是的,因为quant切分的状态我也不知道怎么均分到不同卡上……取决于量化算法