aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Propagate "Invalid address" to NCCL communicator

vmarkovtsev opened this issue · comments

We are using EFA to do asynchronous ncclAllReduce over 120 ranks. Every once in a while and every few hours, the operation breaks. We poll NCCL errors by first checking the scheduled stream status, and, if it is equal to cudaErrorNotReady, we consequently check the communicator errors by calling ncclCommGetAsyncError. Under yet-to-be-fully-understood circumstances (we are investigating together with Amazon support), we see the following log under NCCL_DEBUG=TRACE, FI_LOG_LEVEL=1:

libfabric:451707:1708520807::efa:cq:efa_rdm_txe_handle_error():737<warn> err: 5, message: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c (7)
    libfabric:451707:1708520807::efa:cq:efa_rdm_txe_handle_error():737<warn> err: 5, message: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c (7)
     8[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 00/0 : 0[4] -> 8[4] [send] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 01/0 : 0[4] -> 8[4] [send] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 02/0 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Channel 03/0 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA
    ip-172-31-69-127:451707:453422 [4] 222743.762603 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453422 [4] 222746.623005 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453422 [4] 222767.773259 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453422 [4] 222780.007305 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453373 [4] NCCL INFO Connected all trees
    ip-172-31-69-127:451707:453373 [4] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512
    ip-172-31-69-127:451707:453373 [4] NCCL INFO 4 coll channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
    ip-172-31-69-127:451707:453373 [4] NCCL INFO comm 0x55dcf7b0a630 rank 0 nranks 12 cudaDev 4 nvmlDev 4 busId 97000 commId 0xe25512ad6c40b2e2 - Init COMPLETE
    ip-172-31-69-127:451707:453436 [4] NCCL INFO Channel 02/1 : 0[4] -> 1[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453436 [4] NCCL INFO Channel 03/1 : 0[4] -> 1[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453441 [4] NCCL INFO Channel 02/1 : 0[4] -> 1[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453441 [4] NCCL INFO Channel 03/1 : 0[4] -> 1[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453404 [4] 235242.807784 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453404 [4] 235245.883529 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453490 [4] NCCL INFO Channel 02/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453490 [4] NCCL INFO Channel 03/1 : 1[4] -> 0[4] [send] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453518 [4] NCCL INFO Channel 02/1 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453518 [4] NCCL INFO Channel 03/1 : 1[4] -> 0[4] [receive] via NET/AWS Libfabric/4/GDRDMA/Shared
    ip-172-31-69-127:451707:453391 [4] 294782.055802 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
    ip-172-31-69-127:451707:453391 [4] 294785.363301 alloc_and_reg_flush_buff:3332 NCCL TRACE NET/OFI Registering buffer for flush operations
  
    ip-172-31-69-127:451707:452061 [4] ofi_process_cq:1740 NCCL WARN NET/OFI Request 0x7fec8ae859a0 completed with error. RC: 5. Error: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c. Completed length: 0, Request: { dev: 4, size: 661504, state: CREATED, type: SEND }
  
    ip-172-31-69-127:451707:452061 [4] ofi_process_cq:1740 NCCL WARN NET/OFI Request 0x7fec8ae86e00 completed with error. RC: 5. Error: Invalid address My EFA addr: fi_addr_efa://[fe80::13:c2ff:fe6f:f03d]:0:1072483550 My host id: i-0db7083f3fcbebb65 Peer EFA addr: fi_addr_efa://[fe80::b8:aff:fe44:425]:0:947699476 Peer host id: i-0ce6d3d57139f053c. Completed length: 0, Request: { dev: 4, size: 661504, state: CREATED, type: SEND }

The NCCL communicator doesn't return errors in ncclCommGetAsyncError afterward, and we hang in our error polling loop. It would be very useful if the plugin propagated that error to the communicator so that we could recreate it.

The same code correctly handles errors in other non-Amazon clusters with InfiniBand. When something bad happens to InfiniBand, we handle the errors polled from the communicator and restart.