NCCL error in: ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
mackmake opened this issue · comments
hi
i started training on two nodes and used 125M.yml config file and only changed the directories for data and tokenizer files. also added my own hostfile. now during training it gives me this error:
NCCL error in: ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
i run the code again with NCCL_DEBUG=INFO
and got this:
node2: Last error:
node2: Net : Call to recv from NODE2_IP<56843> failed : Connection refused
node1: node1:15874:17284 [4] NCCL INFO P2P is disabled between connected GPUs 4 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
node1: node1:15874:17284 [4] NCCL INFO Could not enable P2P between dev 4(=61000) and dev 3(=42000)
node1: node1:15874:17284 [4] NCCL INFO Channel 00 : 4[61000] -> 3[42000] via SHM/direct/direct
node1: node1:15874:17284 [4] NCCL INFO P2P is disabled between connected GPUs 4 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
node1: node1:15874:17284 [4] NCCL INFO Could not enable P2P between dev 4(=61000) and dev 3(=42000)
node1: node1:15874:17284 [4] NCCL INFO Channel 01 : 4[61000] -> 3[42000] via SHM/direct/direct
node1: node1:15876:17282 [5] NCCL INFO Connected all trees
node1: node1:15876:17282 [5] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512
node1: node1:15876:17282 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
node1: node1:15874:17284 [4] NCCL INFO Connected all trees
node1: node1:15874:17284 [4] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512
node1: node1:15874:17284 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
node1: node1:15872:17277 [3] NCCL INFO Connected all trees
node1: node1:15872:17277 [3] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512
node1: node1:15872:17277 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
node2: node2:5496:6219 [0] NCCL INFO Channel 00/0 : 0[1000] -> 6[1000] [receive] via NET/Socket/0
node2: node2:5496:6219 [0] NCCL INFO Channel 01/0 : 0[1000] -> 6[1000] [receive] via NET/Socket/0
node2: node2:5496:6219 [0] NCCL INFO Channel 00/0 : 6[1000] -> 0[1000] [send] via NET/Socket/0
node2: node2:5496:6219 [0] NCCL INFO Channel 01/0 : 6[1000] -> 0[1000] [send] via NET/Socket/0
what might be the problem?
how to solve it?
What happens when you run with NCCL_IGNORE_DISABLED_P2P=1
set? Does it crash, or does it run less efficiently than one would desire?
if i can remember, it crashed as i tested with NCCL_IGNORE_DISABLED_P2P
. i stopped using multi-node approach.
NCCL_IGNORE_DISABLED_P2P=1
just disables the warning message. I think NCCL_P2P_DISABLE=1
is what you'd need?
@mackmake -- Please reopen if this doesn't resolve your issue!