Propagate "Invalid address" to NCCL communicator
vmarkovtsev opened this issue · comments
We are using EFA to do asynchronous ncclAllReduce over 120 ranks. Every once in a while and every few hours, the operation breaks. We poll NCCL errors by first checking the scheduled stream status, and, if it is equal to cudaErrorNotReady
, we consequently check the communicator errors by calling ncclCommGetAsyncError
. Under yet-to-be-fully-understood circumstances (we are investigating together with Amazon support), we see the following log under NCCL_DEBUG=TRACE
, FI_LOG_LEVEL=1
:
libfabric:451707:1708520807::efa:cq:efa_rdm_txe_handle_error():737<warn> err: 5, message: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c (7)
libfabric:451707:1708520807::efa:cq:efa_rdm_txe_handle_error():737<warn> err: 5, message: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c (7)
8[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 00/0 : 0[4] -> 8[4] [send] via NET/AWS Libfabric/4/GDRDMA
ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 01/0 : 0[4] -> 8[4] [send] via NET/AWS Libfabric/4/GDRDMA
ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 02/0 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 03/0 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
ip-172-31-69-127:451707:453422 [4] 222743.762603 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
ip-172-31-69-127:451707:453422 [4] 222746.623005 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
ip-172-31-69-127:451707:453422 [4] 222767.773259 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
ip-172-31-69-127:451707:453422 [4] 222780.007305 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
ip-172-31-69-127:451707:453373 [4] NCCL INFO Connected all trees
ip-172-31-69-127:451707:453373 [4] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512
ip-172-31-69-127:451707:453373 [4] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ip-172-31-69-127:451707:453373 [4] NCCL INFO comm 0x55dcf7b0a630 rank 0 nranks 12 cudaDev 4 nvmlDev 4 busId 97000 commId 0xe25512ad6c40b2e2 - Init COMPLETE
ip-172-31-69-127:451707:453436 [4] NCCL INFO Channel 02/1 : 0[4] -> 1[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
ip-172-31-69-127:451707:453436 [4] NCCL INFO Channel 03/1 : 0[4] -> 1[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
ip-172-31-69-127:451707:453441 [4] NCCL INFO Channel 02/1 : 0[4] -> 1[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
ip-172-31-69-127:451707:453441 [4] NCCL INFO Channel 03/1 : 0[4] -> 1[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
ip-172-31-69-127:451707:453404 [4] 235242.807784 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
ip-172-31-69-127:451707:453404 [4] 235245.883529 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
ip-172-31-69-127:451707:453490 [4] NCCL INFO Channel 02/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
ip-172-31-69-127:451707:453490 [4] NCCL INFO Channel 03/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
ip-172-31-69-127:451707:453518 [4] NCCL INFO Channel 02/1 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
ip-172-31-69-127:451707:453518 [4] NCCL INFO Channel 03/1 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
ip-172-31-69-127:451707:453391 [4] 294782.055802 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
ip-172-31-69-127:451707:453391 [4] 294785.363301 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
ip-172-31-69-127:451707:452061 [4] ofi_process_cq:1740 NCCL WARN NET/OFI Request 0x7fec8ae859a0 completed with error. RC: 5. Error: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c. Completed length: 0, Request: { dev: 4, size: 661504, state: CREATED, type: SEND }
ip-172-31-69-127:451707:452061 [4] ofi_process_cq:1740 NCCL WARN NET/OFI Request 0x7fec8ae86e00 completed with error. RC: 5. Error: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c. Completed length: 0, Request: { dev: 4, size: 661504, state: CREATED, type: SEND }
The NCCL communicator doesn't return errors in ncclCommGetAsyncError
afterward, and we hang in our error polling loop. It would be very useful if the plugin propagated that error to the communicator so that we could recreate it.
The same code correctly handles errors in other non-Amazon clusters with InfiniBand. When something bad happens to InfiniBand, we handle the errors polled from the communicator and restart.