Training on multiple GPUs across a cluster of multiple machines

Question

Training on multiple GPUs across a cluster of multiple machines

alan890104 opened this issue a year ago · comments

❓ Questions & Help

I believe this package has great potential, but I am currently facing an issue: I cannot train with multiple GPUs across a cluster of multiple machines. Can you please advise on how to properly configure this?

# On master machine 1, with 1 gpu
python3 main.py --master-addr <myip> --master-port <myport> --local_rank 0 --dataset cora citeseer --model gcn gat --distributed --devices 0 1

# On slave machine 2, with 1 gpu
python3 main.py --master-addr <myip> --master-port <myport> --local_rank 1 --dataset cora citeseer --model gcn gat --distributed --devices 0 1

Since setting local_rank, master-addr, master-port does not work for me, so i try to figure out the reason. I found that the package use device_count < self.world_size in Trainer, however, the variable device_count will always be 1 in my case because the function cuda.device_count() only return the number of local gpu. Thus, the world size will be replaced by local gpu count.

I believe there may be some issues, could you help clarify them for me? If there are indeed some errors, how can I assist in improving the situation?

cenyk1230 · Answer 1 · Tue Apr 04 2023 15:35:07 GMT+0800 (China Standard Time)

Hi @alan890104,

Thanks for your interest in CogDL.
We only test the usage of multiple GPUs on a single machine since we believe it can cover most situations.
The training usage of multiple machines is not supported currently.