aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bad performance on two p4 instances (nccl-v2.7.8, aws-ofi-nccl-v1.1.2)

zarzen opened this issue · comments

Hi there,
I have installed the plugin with nccl-v2.7.8, CUDA11. While using nccl-tests didn't give reasonable results. (suppose to see bus bandwidth over 40GB/s). One thing I noticed is that all nccl-channel use via NET/AWS Libfabric/0.
Here is the full log with output from nccl-debug info.

Any suggestion for configuring the plugin or nccl?

# nThread 1 nGpus 1 minBytes 8 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   9738 on ip-172-31-13-103 device  0 [0x10] A100-SXM4-40GB
#   Rank  1 Pid   9739 on ip-172-31-13-103 device  1 [0x10] A100-SXM4-40GB
#   Rank  2 Pid   9740 on ip-172-31-13-103 device  2 [0x20] A100-SXM4-40GB
#   Rank  3 Pid   9741 on ip-172-31-13-103 device  3 [0x20] A100-SXM4-40GB
#   Rank  4 Pid   9742 on ip-172-31-13-103 device  4 [0x90] A100-SXM4-40GB
#   Rank  5 Pid   9743 on ip-172-31-13-103 device  5 [0x90] A100-SXM4-40GB
#   Rank  6 Pid   9744 on ip-172-31-13-103 device  6 [0xa0] A100-SXM4-40GB
#   Rank  7 Pid   9748 on ip-172-31-13-103 device  7 [0xa0] A100-SXM4-40GB
#   Rank  8 Pid   9921 on ip-172-31-6-104 device  0 [0x10] A100-SXM4-40GB
#   Rank  9 Pid   9922 on ip-172-31-6-104 device  1 [0x10] A100-SXM4-40GB
#   Rank 10 Pid   9923 on ip-172-31-6-104 device  2 [0x20] A100-SXM4-40GB
#   Rank 11 Pid   9924 on ip-172-31-6-104 device  3 [0x20] A100-SXM4-40GB
#   Rank 12 Pid   9925 on ip-172-31-6-104 device  4 [0x90] A100-SXM4-40GB
#   Rank 13 Pid   9926 on ip-172-31-6-104 device  5 [0x90] A100-SXM4-40GB
#   Rank 14 Pid   9927 on ip-172-31-6-104 device  6 [0xa0] A100-SXM4-40GB
#   Rank 15 Pid   9928 on ip-172-31-6-104 device  7 [0xa0] A100-SXM4-40GB
ip-172-31-13-103:9738:9738 [0] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9738:9738 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9738:9738 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9738:9738 [0] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9922:9922 [1] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9923:9923 [2] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9926:9926 [5] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9924:9924 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9921:9921 [0] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9927:9927 [6] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9925:9925 [4] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9928:9928 [7] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9922:9922 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9922:9922 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9922:9922 [1] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9923:9923 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9923:9923 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9923:9923 [2] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9926:9926 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9926:9926 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9926:9926 [5] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9924:9924 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9924:9924 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9924:9924 [3] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9928:9928 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9928:9928 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9928:9928 [7] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9925:9925 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9925:9925 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9925:9925 [4] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9927:9927 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9927:9927 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9927:9927 [6] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9921:9921 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9921:9921 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9921:9921 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-13-103:9744:9744 [6] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9743:9743 [5] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9744:9744 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9744:9744 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9744:9744 [6] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9742 [4] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9748:9748 [7] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9740:9740 [2] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9739:9739 [1] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9741:9741 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9743:9743 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9743:9743 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9743:9743 [5] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9742 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9742:9742 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9742:9742 [4] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9740:9740 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9740:9740 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9740:9740 [2] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9748:9748 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9748:9748 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9748:9748 [7] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9739:9739 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9739:9739 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9739:9739 [1] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9741:9741 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9741:9741 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9741:9741 [3] NCCL INFO Using network AWS Libfabric

ip-172-31-13-103:9742:9800 [4] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9738:9797 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9923:9981 [2] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9927:9985 [6] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9921:9986 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9739:9804 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9744:9798 [6] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9748:9801 [7] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9741:9803 [3] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9743:9799 [5] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9740:9802 [2] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9928:9984 [7] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9926:9982 [5] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9925:9987 [4] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9924:9983 [3] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9922:9980 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order

ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
ip-172-31-13-103:9738:9797 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9738:9797 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->9|9->0->1/-1/-1
ip-172-31-13-103:9738:9797 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
ip-172-31-6-104:9925:9987 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9925:9987 [4] NCCL INFO Trees [0] 13/-1/-1->12->11|11->12->13/-1/-1 [1] 13/-1/-1->12->11|11->12->13/-1/-1
ip-172-31-6-104:9926:9982 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9926:9982 [5] NCCL INFO Trees [0] 14/-1/-1->13->12|12->13->14/-1/-1 [1] 14/-1/-1->13->12|12->13->14/-1/-1
ip-172-31-13-103:9739:9804 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9739:9804 [1] NCCL INFO Trees [0] 2/8/-1->1->0|0->1->2/8/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
ip-172-31-6-104:9924:9983 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9924:9983 [3] NCCL INFO Trees [0] 12/-1/-1->11->10|10->11->12/-1/-1 [1] 12/-1/-1->11->10|10->11->12/-1/-1
ip-172-31-13-103:9739:9804 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
ip-172-31-6-104:9923:9981 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9923:9981 [2] NCCL INFO Trees [0] 11/-1/-1->10->9|9->10->11/-1/-1 [1] 11/-1/-1->10->9|9->10->11/-1/-1
ip-172-31-6-104:9922:9980 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9922:9980 [1] NCCL INFO Trees [0] 10/-1/-1->9->8|8->9->10/-1/-1 [1] 10/0/-1->9->8|8->9->10/0/-1
ip-172-31-6-104:9922:9980 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
ip-172-31-6-104:9928:9984 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9928:9984 [7] NCCL INFO Trees [0] -1/-1/-1->15->14|14->15->-1/-1/-1 [1] -1/-1/-1->15->14|14->15->-1/-1/-1
ip-172-31-6-104:9923:9981 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
ip-172-31-6-104:9927:9985 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9927:9985 [6] NCCL INFO Trees [0] 15/-1/-1->14->13|13->14->15/-1/-1 [1] 15/-1/-1->14->13|13->14->15/-1/-1
ip-172-31-6-104:9927:9985 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9925:9987 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9740:9802 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9740:9802 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
ip-172-31-6-104:9924:9983 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
ip-172-31-13-103:9740:9802 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
ip-172-31-13-103:9741:9803 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9741:9803 [3] NCCL INFO Trees [0] 4/-1/-1->3->2|2->3->4/-1/-1 [1] 4/-1/-1->3->2|2->3->4/-1/-1
ip-172-31-13-103:9741:9803 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
ip-172-31-6-104:9928:9984 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9926:9982 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9742:9800 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9742:9800 [4] NCCL INFO Trees [0] 5/-1/-1->4->3|3->4->5/-1/-1 [1] 5/-1/-1->4->3|3->4->5/-1/-1
ip-172-31-13-103:9742:9800 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9743:9799 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9743:9799 [5] NCCL INFO Trees [0] 6/-1/-1->5->4|4->5->6/-1/-1 [1] 6/-1/-1->5->4|4->5->6/-1/-1
ip-172-31-13-103:9743:9799 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9744:9798 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9744:9798 [6] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1
ip-172-31-13-103:9744:9798 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9748:9801 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9748:9801 [7] NCCL INFO Trees [0] -1/-1/-1->7->6|6->7->-1/-1/-1 [1] -1/-1/-1->7->6|6->7->-1/-1/-1
ip-172-31-13-103:9748:9801 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9921:9986 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9921:9986 [0] NCCL INFO Trees [0] 9/-1/-1->8->1|1->8->9/-1/-1 [1] 9/-1/-1->8->-1|-1->8->9/-1/-1
ip-172-31-6-104:9921:9986 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00 : 15[a01d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 00 : 9[101d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 00 : 3[201d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 7[a01d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 00 : 12[901c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 00 : 10[201c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 00 : 11[201d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 00 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 00 : 13[901d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 00 : 5[901d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 00 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 00 : 2[201c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 00 : 15[a01d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 00 : 7[a01d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 00 : 4[901c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 00 : 3[201d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 00 : 5[901d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 00 : 2[201c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 00 : 6[a01c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 00 : 4[901c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 00 : 12[901c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 00 : 10[201c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 00 : 11[201d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 00 : 14[a01c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 00 : 13[901d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 01 : 3[201d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 01 : 5[901d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 01 : 4[901c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 01 : 12[901c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 01 : 11[201d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 01 : 13[901d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 01 : 4[901c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 01 : 12[901c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00 : 0[101c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 01 : 2[201c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 8[101c0] -> 1[101d0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 01 : 3[201d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 8[101c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 00 : 9[101d0] -> 8[101c0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 01 : 10[201c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 01 : 11[201d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 0[101c0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 00 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 00 : 15[a01d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 01 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 01 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 01 : 7[a01d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 01 : 15[a01d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 01 : 13[901d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 01 : 5[901d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 01 : 14[a01c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9742:9800 [4] NCCL INFO comm 0x7f2ac8000dc0 rank 4 nranks 16 cudaDev 4 busId 901c0 - Init COMPLETE
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 01 : 6[a01c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9925:9987 [4] NCCL INFO comm 0x7f8da8000dc0 rank 12 nranks 16 cudaDev 4 busId 901c0 - Init COMPLETE
ip-172-31-13-103:9743:9799 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9743:9799 [5] NCCL INFO comm 0x7f73a4000dc0 rank 5 nranks 16 cudaDev 5 busId 901d0 - Init COMPLETE
ip-172-31-6-104:9926:9982 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9926:9982 [5] NCCL INFO comm 0x7f4328000dc0 rank 13 nranks 16 cudaDev 5 busId 901d0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 15[a01d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 0[101c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 8[101c0] -> 1[101d0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 01 : 15[a01d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9928:9984 [7] NCCL INFO comm 0x7f3a98000dc0 rank 15 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 01 : 10[201c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9927:9985 [6] NCCL INFO comm 0x7fa984000dc0 rank 14 nranks 16 cudaDev 6 busId a01c0 - Init COMPLETE
ip-172-31-6-104:9924:9983 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9924:9983 [3] NCCL INFO comm 0x7f3bb8000dc0 rank 11 nranks 16 cudaDev 3 busId 201d0 - Init COMPLETE
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 01 : 1[101d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 01 : 7[a01d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 01 : 8[101c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 01 : 2[201c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 01 : 1[101d0] -> 0[101c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO comm 0x7f6c4c000dc0 rank 3 nranks 16 cudaDev 3 busId 201d0 - Init COMPLETE
ip-172-31-6-104:9923:9981 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9923:9981 [2] NCCL INFO comm 0x7f5a24000dc0 rank 10 nranks 16 cudaDev 2 busId 201c0 - Init COMPLETE
ip-172-31-13-103:9740:9802 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9740:9802 [2] NCCL INFO comm 0x7f36c8000dc0 rank 2 nranks 16 cudaDev 2 busId 201c0 - Init COMPLETE
ip-172-31-13-103:9739:9804 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9739:9804 [1] NCCL INFO comm 0x7fed38000dc0 rank 1 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 0[101c0] -> 9[101d0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 0[101c0] -> 9[101d0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 01 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9748:9801 [7] NCCL INFO comm 0x7fbd04000dc0 rank 7 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
ip-172-31-13-103:9744:9798 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9744:9798 [6] NCCL INFO comm 0x7f23b8000dc0 rank 6 nranks 16 cudaDev 6 busId a01c0 - Init COMPLETE
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 8[101c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9921:9986 [0] NCCL INFO comm 0x7fdc3c000dc0 rank 8 nranks 16 cudaDev 0 busId 101c0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 9[101d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9922:9980 [1] NCCL INFO comm 0x7f17c8000dc0 rank 9 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9738:9797 [0] NCCL INFO comm 0x7faad8000dc0 rank 0 nranks 16 cudaDev 0 busId 101c0 - Init COMPLETE
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
ip-172-31-13-103:9738:9738 [0] NCCL INFO Launch mode Parallel
           8             2   float     sum    79.04    0.00    0.00  4e-07    64.55    0.00    0.00  4e-07
          16             4   float     sum    63.60    0.00    0.00  4e-07    63.44    0.00    0.00  2e-07
          32             8   float     sum    64.62    0.00    0.00  2e-07    65.10    0.00    0.00  1e-07
          64            16   float     sum    65.01    0.00    0.00  1e-07    64.24    0.00    0.00  1e-07
         128            32   float     sum    65.24    0.00    0.00  1e-07    64.54    0.00    0.00  1e-07
         256            64   float     sum    65.57    0.00    0.01  1e-07    64.45    0.00    0.01  1e-07
         512           128   float     sum    66.94    0.01    0.01  1e-07    66.50    0.01    0.01  1e-07
        1024           256   float     sum    68.47    0.01    0.03  4e-07    68.17    0.02    0.03  4e-07
        2048           512   float     sum    72.68    0.03    0.05  4e-07    72.91    0.03    0.05  4e-07
        4096          1024   float     sum    78.42    0.05    0.10  4e-07    77.77    0.05    0.10  4e-07
        8192          2048   float     sum    83.65    0.10    0.18  4e-07    81.94    0.10    0.19  4e-07
       16384          4096   float     sum    96.39    0.17    0.32  4e-07    93.38    0.18    0.33  4e-07
       32768          8192   float     sum    116.7    0.28    0.53  4e-07    114.8    0.29    0.54  4e-07
       65536         16384   float     sum    155.3    0.42    0.79  4e-07    153.4    0.43    0.80  4e-07
      131072         32768   float     sum    203.0    0.65    1.21  4e-07    203.7    0.64    1.21  4e-07
      262144         65536   float     sum    315.6    0.83    1.56  4e-07    311.2    0.84    1.58  4e-07
      524288        131072   float     sum    409.9    1.28    2.40  4e-07    407.1    1.29    2.41  4e-07
     1048576        262144   float     sum    597.2    1.76    3.29  4e-07    594.6    1.76    3.31  4e-07
     2097152        524288   float     sum    926.9    2.26    4.24  4e-07    924.9    2.27    4.25  4e-07
     4194304       1048576   float     sum   1583.5    2.65    4.97  4e-07   1584.1    2.65    4.96  4e-07
     8388608       2097152   float     sum   2939.5    2.85    5.35  4e-07   2929.9    2.86    5.37  4e-07
    16777216       4194304   float     sum   5366.1    3.13    5.86  4e-07   5381.9    3.12    5.85  4e-07
    33554432       8388608   float     sum    10305    3.26    6.10  4e-07    10294    3.26    6.11  4e-07
    67108864      16777216   float     sum    20358    3.30    6.18  4e-07    20341    3.30    6.19  4e-07
   134217728      33554432   float     sum    39328    3.41    6.40  4e-07    39392    3.41    6.39  4e-07
   268435456      67108864   float     sum    77210    3.48    6.52  4e-07    77304    3.47    6.51  4e-07
   536870912     134217728   float     sum   152989    3.51    6.58  4e-07   152798    3.51    6.59  4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.32362 

Hi the bandwidth is indeed too low for p4d. Two things worth looking into:

  1. Make sure you set the environment variable FI_EFA_USE_DEVICE_RDMA to 1. This would mean pass -x FI_EFA_USE_DEVICE_RDMA=1 to your mpirun command. You should see [receive/send] via NET/AWS Libfabric/GDRDMA
  2. Make sure you have 4 EFA devices attached to each p4d instance. You can do this by running lspci on your instance.

Make sure you set the environment variable FI_EFA_USE_DEVICE_RDMA to 1. This would mean pass -x FI_EFA_USE_DEVICE_RDMA=1 to your mpirun command. You should see [receive/send] via NET/AWS Libfabric/GDRDMA

You also must be sure to pass the -g option to the efa_installer.sh on your host instance

Make sure you set the environment variable FI_EFA_USE_DEVICE_RDMA to 1. This would mean pass -x FI_EFA_USE_DEVICE_RDMA=1 to your mpirun command. You should see [receive/send] via NET/AWS Libfabric/GDRDMA

You also must be sure to pass the -g option to the efa_installer.sh on your host instance

Hi @leezu
I didn't install EFA part from scratch, the EFA is installed by Deep Learning AMI (ubuntu18.04, ver43.0).
Do you mean I need to reinstall it?

Hi @wzamazon
I have passed the FI_EFA_USE_DEVICE_RDMA=1 flag to mpirun command, here is the launch script:

/opt/amazon/openmpi/bin/mpirun \
     -n ${NUM_PROCS} -H ${HOSTS} \
     -x RDMAV_FORK_SAFE=1 -x NCCL_DEBUG=info \
     -x FI_EFA_USE_DEVICE_RDMA=1 \
     --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
     $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 512M -f 2 -g 1 -c 1 -n 20

And here is the lspci log, 4 EFA NIC are attached. I have also verified with fi_info -p efa command.

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
10:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
10:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
10:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
10:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
10:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
10:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
20:01.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
20:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
20:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
20:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
20:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
20:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
80:1a.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1b.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1c.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1d.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1e.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1f.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
90:02.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
90:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
90:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
90:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
90:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
90:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
a0:03.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
a0:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
a0:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
a0:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
a0:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
a0:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller

fi_info -p efa log:

provider: efa
    fabric: EFA-fe80::c8:f6ff:fe4d:1df3
    domain: rdmap16s27-rdm
    version: 111.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::cc:b9ff:fe43:e655
    domain: rdmap32s27-rdm
    version: 111.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::7a:70ff:feea:56bb
    domain: rdmap144s27-rdm
    version: 111.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::b5:58ff:fe4c:a9f
    domain: rdmap160s27-rdm
    version: 111.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::c8:f6ff:fe4d:1df3
    domain: rdmap16s27-dgrm
    version: 111.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::cc:b9ff:fe43:e655
    domain: rdmap32s27-dgrm
    version: 111.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::7a:70ff:feea:56bb
    domain: rdmap144s27-dgrm
    version: 111.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::b5:58ff:fe4c:a9f
    domain: rdmap160s27-dgrm
    version: 111.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

What version of aws-ofi-nccl plugin are you using?

What version of aws-ofi-nccl plugin are you using?

v1.1.2,

compiled with following commands:

./autogen.sh --prefix=/usr/local --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --with-nccl=/usr/local/cuda --with-mpi=/opt/amazon/openmpi

./configure --prefix=/usr/local --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --with-nccl=/usr/local/cuda --with-mpi=/opt/amazon/openmpi

make 

sudo make install

check you LD_LIBRARY_PATH, it maybe the aws-ofi-nccl plugin you compiled was not picked up.

On p4d platform, there should be a line like

ip-192-168-2-54:14:14 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/ec2-user/install/plugin/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml

in the log. According to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-dlami.html

LD_LIBRARY_PATH

The LD_LIBRARY_PATH includes the /usr/local/lib, where I have installed the plugin.

where can I find the p4d-24xl-topo.xml ?
I can try to pass the file to nccl manually

p4d-24xl-topo.xml should be part of aws-ofi-nccl plugin

p4d-24xl-topo.xml should be part of aws-ofi-nccl plugin

I didn't find the file in the source code.
while the bandwidth issues solved with pre-compiled aws-ofi-nccl inside /usr/local/cuda-11.0/efa/ folder.
Here is the command:

/opt/amazon/openmpi/bin/mpirun \
     -n ${NUM_PROCS} -H ${HOSTS} \
     -x FI_EFA_USE_DEVICE_RDMA=1 -x RDMAV_FORK_SAFE=1 --mca pml ^cm \
     -x LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:$LD_LIBRARY_PATH \
     -x NCCL_DEBUG=INFO \
     -x FI_PROVIDER="efa" --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
     $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 512M -f 2 -g 1 -c 1 -n 20

So the plugin I compiled is possibly missed some configuration.

The file is only present on the aws branch (https://github.com/aws/aws-ofi-nccl/blob/aws/topology/p4d-24xl-topo.xml). Is it possible that code from the wrong branch was compiled?

The file is only present on the aws branch (https://github.com/aws/aws-ofi-nccl/blob/aws/topology/p4d-24xl-topo.xml). Is it possible that code from the wrong branch was compiled?

What is the difference between master branch and aws branch? I thought this plugin is only for aws-platform

Compiling source code at aws branch solves the issue

I followed up on your question about the difference between master and aws branches. The long-term goal of this project is to make a plugin that can be used to connect NCCL to any libfabric provider. Eventually the plan is to merge the AWS-specific branch into master and discontinue it (but no timeline on this at present).