BaguaSys / bagua

Bagua Speeds up PyTorch

Home Page:https://tutorials-8ro.pages.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

errors with DecentralizedAlgorithm in shift_one mode

ProHuper opened this issue · comments

I used DecentralizedAlgorithm in shift_one peer_selection_mode with 8 GPUs, bagua backend says i have odd number ranks (only one), but you can see from the NCCL log that this job does have 8 GPUs. Does n_ranks here mean node number or gpu number exactly?

image

nranks means the number of ranks in the NCCL communicator.

Decentralized algorithm will enable hierarchical reduce by default, which means only inter-node decentralized communication will be performed, with an intra-node allreduce before it and an intra-node bcast after it. To try it out on 8 GPUs, set hierarchical =False. See API for details.

got it!