“No backend type associated with device type cpu” when run cli_demo_sat.py
yileld opened this issue · comments
Traceback (most recent call last):
File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
main()
File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
model, model_args = AutoModel.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 367, in from_pretrained
mp_split_model_receive(model, use_node_group=use_node_group)
File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 91, in mp_split_model_receive
iter_repartition(model)
File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
iter_repartition(sub_module)
File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
iter_repartition(sub_module)
File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 84, in iter_repartition
torch.distributed.recv(sub_module.weight.data, src)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1640, in recv
pg.recv([tensor], src, tag).wait()
RuntimeError: No backend type associated with device type cpu
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[2024-03-05 14:32:43,744] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 50878 closing signal SIGTERM
原来能跑起来的,现在又不行了,是sat又更新了吗?
目前版本torch=2.1.2,sat=0.4.11,transformers=4.38.2
如果想用cpu运行,请确保CUDA_VISIBLE_DEVICES=空
如果想用cpu运行,请确保CUDA_VISIBLE_DEVICES=空
是想用GPU运行的,但是有quant 8,所以AutoModel.from_pretrained()一开始是在CPU上吧
quant 8暂时不支持overwrite_arge={'model_parallel_size'}
quant 8暂时不支持overwrite_arge={'model_parallel_size'}
那难道是我记错了。。。所以目前quant是不支持多卡推理的是吧
另外我改成bf16报错
Traceback (most recent call last):
File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
main()
File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
model, model_args = AutoModel.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 368, in from_pretrained
reset_random_seed(6)
File "/usr/local/lib/python3.10/dist-packages/sat/arguments.py", line 572, in reset_random_seed
assert _GLOBAL_RANDOM_SEED is not None, "You have not set random seed. No need to reset it."
AssertionError: You have not set random seed. No need to reset it.
是的,因为quant切分的状态我也不知道怎么均分到不同卡上……取决于量化算法