[BUG]无法在windows上训练
unlimit999 opened this issue · comments
unlimit999 commented
看了下报错原因是不支持nccl,可否提供选项或者自动判断作业系统改用gloo?
unlimit999 commented
暫時找到的解決方法...
.\fishenv\Lib\site-packages\lightning\fabric\utilities\distributed.py
找到這個函數
def _init_dist_connection(
cluster_environment: "ClusterEnvironment",
torch_distributed_backend: str,
global_rank: Optional[int] = None,
world_size: Optional[int] = None,
**kwargs: Any,
) -> None:
"""Utility function to initialize distributed connection by setting env variables and initializing the distributed
process group.
Args:
cluster_environment: ``ClusterEnvironment`` instance
torch_distributed_backend: Backend to use (includes `nccl` and `gloo`)
global_rank: Rank of the current process
world_size: Number of processes in the group
kwargs: Kwargs for ``init_process_group``
Raises:
RuntimeError:
If ``torch.distributed`` is not available
"""
if not torch.distributed.is_available():
raise RuntimeError("torch.distributed is not available. Cannot initialize distributed process group")
if torch.distributed.is_initialized():
log.debug("torch.distributed is already initialized. Exiting early")
return
global_rank = global_rank if global_rank is not None else cluster_environment.global_rank()
world_size = world_size if world_size is not None else cluster_environment.world_size()
os.environ["MASTER_ADDR"] = cluster_environment.main_address
os.environ["MASTER_PORT"] = str(cluster_environment.main_port)
log.info(f"Initializing distributed: GLOBAL_RANK: {global_rank}, MEMBER: {global_rank + 1}/{world_size}")
#torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
torch.distributed.init_process_group(backend=torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
# On rank=0 let everyone know training is starting
rank_zero_info(
f"{'-' * 100}\n"
f"distributed_backend={torch_distributed_backend}\n"
f"All distributed processes registered. Starting with {world_size} processes\n"
f"{'-' * 100}\n"
)
將torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)這一行第一個參數torch_distributed_backend改成backend=torch_distributed_backend即可正常訓練
Leng Yue commented
目前已经实现了来着, 你可以把配置文件里面的 backend 改成 gloo, 或者用命令行覆盖的.
xumason commented
请问是在这个函数里,把所有的 backend 都改成 gloo 吗?