bad performance on two p4 instances (nccl-v2.7.8, aws-ofi-nccl-v1.1.2)
zarzen opened this issue · comments
Hi there,
I have installed the plugin with nccl-v2.7.8, CUDA11. While using nccl-tests
didn't give reasonable results. (suppose to see bus bandwidth over 40GB/s). One thing I noticed is that all nccl-channel use via NET/AWS Libfabric/0
.
Here is the full log with output from nccl-debug info.
Any suggestion for configuring the plugin or nccl?
# nThread 1 nGpus 1 minBytes 8 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 9738 on ip-172-31-13-103 device 0 [0x10] A100-SXM4-40GB
# Rank 1 Pid 9739 on ip-172-31-13-103 device 1 [0x10] A100-SXM4-40GB
# Rank 2 Pid 9740 on ip-172-31-13-103 device 2 [0x20] A100-SXM4-40GB
# Rank 3 Pid 9741 on ip-172-31-13-103 device 3 [0x20] A100-SXM4-40GB
# Rank 4 Pid 9742 on ip-172-31-13-103 device 4 [0x90] A100-SXM4-40GB
# Rank 5 Pid 9743 on ip-172-31-13-103 device 5 [0x90] A100-SXM4-40GB
# Rank 6 Pid 9744 on ip-172-31-13-103 device 6 [0xa0] A100-SXM4-40GB
# Rank 7 Pid 9748 on ip-172-31-13-103 device 7 [0xa0] A100-SXM4-40GB
# Rank 8 Pid 9921 on ip-172-31-6-104 device 0 [0x10] A100-SXM4-40GB
# Rank 9 Pid 9922 on ip-172-31-6-104 device 1 [0x10] A100-SXM4-40GB
# Rank 10 Pid 9923 on ip-172-31-6-104 device 2 [0x20] A100-SXM4-40GB
# Rank 11 Pid 9924 on ip-172-31-6-104 device 3 [0x20] A100-SXM4-40GB
# Rank 12 Pid 9925 on ip-172-31-6-104 device 4 [0x90] A100-SXM4-40GB
# Rank 13 Pid 9926 on ip-172-31-6-104 device 5 [0x90] A100-SXM4-40GB
# Rank 14 Pid 9927 on ip-172-31-6-104 device 6 [0xa0] A100-SXM4-40GB
# Rank 15 Pid 9928 on ip-172-31-6-104 device 7 [0xa0] A100-SXM4-40GB
ip-172-31-13-103:9738:9738 [0] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9738:9738 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9738:9738 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9738:9738 [0] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9922:9922 [1] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9923:9923 [2] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9926:9926 [5] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9924:9924 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9921:9921 [0] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9927:9927 [6] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9925:9925 [4] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9928:9928 [7] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9922:9922 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9922:9922 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9922:9922 [1] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9923:9923 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9923:9923 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9923:9923 [2] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9926:9926 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9926:9926 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9926:9926 [5] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9924:9924 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9924:9924 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9924:9924 [3] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9928:9928 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9928:9928 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9928:9928 [7] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9925:9925 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9925:9925 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9925:9925 [4] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9927:9927 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9927:9927 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9927:9927 [6] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9921:9921 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9921:9921 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9921:9921 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-13-103:9744:9744 [6] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9743:9743 [5] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9744:9744 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9744:9744 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9744:9744 [6] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9742 [4] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9748:9748 [7] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9740:9740 [2] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9739:9739 [1] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9741:9741 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9743:9743 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9743:9743 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9743:9743 [5] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9742 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9742:9742 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9742:9742 [4] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9740:9740 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9740:9740 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9740:9740 [2] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9748:9748 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9748:9748 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9748:9748 [7] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9739:9739 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9739:9739 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9739:9739 [1] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9741:9741 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9741:9741 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9741:9741 [3] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9800 [4] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9738:9797 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9923:9981 [2] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9927:9985 [6] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9921:9986 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9739:9804 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9744:9798 [6] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9748:9801 [7] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9741:9803 [3] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9743:9799 [5] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9740:9802 [2] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9928:9984 [7] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9926:9982 [5] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9925:9987 [4] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9924:9983 [3] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9922:9980 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ip-172-31-13-103:9738:9797 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9738:9797 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->9|9->0->1/-1/-1
ip-172-31-13-103:9738:9797 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
ip-172-31-6-104:9925:9987 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9925:9987 [4] NCCL INFO Trees [0] 13/-1/-1->12->11|11->12->13/-1/-1 [1] 13/-1/-1->12->11|11->12->13/-1/-1
ip-172-31-6-104:9926:9982 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9926:9982 [5] NCCL INFO Trees [0] 14/-1/-1->13->12|12->13->14/-1/-1 [1] 14/-1/-1->13->12|12->13->14/-1/-1
ip-172-31-13-103:9739:9804 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9739:9804 [1] NCCL INFO Trees [0] 2/8/-1->1->0|0->1->2/8/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
ip-172-31-6-104:9924:9983 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9924:9983 [3] NCCL INFO Trees [0] 12/-1/-1->11->10|10->11->12/-1/-1 [1] 12/-1/-1->11->10|10->11->12/-1/-1
ip-172-31-13-103:9739:9804 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
ip-172-31-6-104:9923:9981 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9923:9981 [2] NCCL INFO Trees [0] 11/-1/-1->10->9|9->10->11/-1/-1 [1] 11/-1/-1->10->9|9->10->11/-1/-1
ip-172-31-6-104:9922:9980 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9922:9980 [1] NCCL INFO Trees [0] 10/-1/-1->9->8|8->9->10/-1/-1 [1] 10/0/-1->9->8|8->9->10/0/-1
ip-172-31-6-104:9922:9980 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
ip-172-31-6-104:9928:9984 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9928:9984 [7] NCCL INFO Trees [0] -1/-1/-1->15->14|14->15->-1/-1/-1 [1] -1/-1/-1->15->14|14->15->-1/-1/-1
ip-172-31-6-104:9923:9981 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
ip-172-31-6-104:9927:9985 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9927:9985 [6] NCCL INFO Trees [0] 15/-1/-1->14->13|13->14->15/-1/-1 [1] 15/-1/-1->14->13|13->14->15/-1/-1
ip-172-31-6-104:9927:9985 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9925:9987 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9740:9802 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9740:9802 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
ip-172-31-6-104:9924:9983 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
ip-172-31-13-103:9740:9802 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
ip-172-31-13-103:9741:9803 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9741:9803 [3] NCCL INFO Trees [0] 4/-1/-1->3->2|2->3->4/-1/-1 [1] 4/-1/-1->3->2|2->3->4/-1/-1
ip-172-31-13-103:9741:9803 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
ip-172-31-6-104:9928:9984 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9926:9982 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9742:9800 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9742:9800 [4] NCCL INFO Trees [0] 5/-1/-1->4->3|3->4->5/-1/-1 [1] 5/-1/-1->4->3|3->4->5/-1/-1
ip-172-31-13-103:9742:9800 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9743:9799 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9743:9799 [5] NCCL INFO Trees [0] 6/-1/-1->5->4|4->5->6/-1/-1 [1] 6/-1/-1->5->4|4->5->6/-1/-1
ip-172-31-13-103:9743:9799 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9744:9798 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9744:9798 [6] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1
ip-172-31-13-103:9744:9798 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9748:9801 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9748:9801 [7] NCCL INFO Trees [0] -1/-1/-1->7->6|6->7->-1/-1/-1 [1] -1/-1/-1->7->6|6->7->-1/-1/-1
ip-172-31-13-103:9748:9801 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9921:9986 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9921:9986 [0] NCCL INFO Trees [0] 9/-1/-1->8->1|1->8->9/-1/-1 [1] 9/-1/-1->8->-1|-1->8->9/-1/-1
ip-172-31-6-104:9921:9986 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00 : 15[a01d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 00 : 9[101d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 00 : 3[201d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 7[a01d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 00 : 12[901c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 00 : 10[201c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 00 : 11[201d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 00 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 00 : 13[901d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 00 : 5[901d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 00 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 00 : 2[201c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 00 : 15[a01d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 00 : 7[a01d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 00 : 4[901c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 00 : 3[201d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 00 : 5[901d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 00 : 2[201c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 00 : 6[a01c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 00 : 4[901c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 00 : 12[901c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 00 : 10[201c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 00 : 11[201d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 00 : 14[a01c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 00 : 13[901d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 01 : 3[201d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 01 : 5[901d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 01 : 4[901c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 01 : 12[901c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 01 : 11[201d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 01 : 13[901d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 01 : 4[901c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 01 : 12[901c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00 : 0[101c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 01 : 2[201c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 8[101c0] -> 1[101d0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 01 : 3[201d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 8[101c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 00 : 9[101d0] -> 8[101c0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 01 : 10[201c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 01 : 11[201d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 0[101c0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 00 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 00 : 15[a01d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 01 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 01 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 01 : 7[a01d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 01 : 15[a01d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 01 : 13[901d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 01 : 5[901d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 01 : 14[a01c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9742:9800 [4] NCCL INFO comm 0x7f2ac8000dc0 rank 4 nranks 16 cudaDev 4 busId 901c0 - Init COMPLETE
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 01 : 6[a01c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9925:9987 [4] NCCL INFO comm 0x7f8da8000dc0 rank 12 nranks 16 cudaDev 4 busId 901c0 - Init COMPLETE
ip-172-31-13-103:9743:9799 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9743:9799 [5] NCCL INFO comm 0x7f73a4000dc0 rank 5 nranks 16 cudaDev 5 busId 901d0 - Init COMPLETE
ip-172-31-6-104:9926:9982 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9926:9982 [5] NCCL INFO comm 0x7f4328000dc0 rank 13 nranks 16 cudaDev 5 busId 901d0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 15[a01d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 0[101c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 8[101c0] -> 1[101d0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 01 : 15[a01d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9928:9984 [7] NCCL INFO comm 0x7f3a98000dc0 rank 15 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 01 : 10[201c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9927:9985 [6] NCCL INFO comm 0x7fa984000dc0 rank 14 nranks 16 cudaDev 6 busId a01c0 - Init COMPLETE
ip-172-31-6-104:9924:9983 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9924:9983 [3] NCCL INFO comm 0x7f3bb8000dc0 rank 11 nranks 16 cudaDev 3 busId 201d0 - Init COMPLETE
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 01 : 1[101d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 01 : 7[a01d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 01 : 8[101c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 01 : 2[201c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 01 : 1[101d0] -> 0[101c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO comm 0x7f6c4c000dc0 rank 3 nranks 16 cudaDev 3 busId 201d0 - Init COMPLETE
ip-172-31-6-104:9923:9981 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9923:9981 [2] NCCL INFO comm 0x7f5a24000dc0 rank 10 nranks 16 cudaDev 2 busId 201c0 - Init COMPLETE
ip-172-31-13-103:9740:9802 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9740:9802 [2] NCCL INFO comm 0x7f36c8000dc0 rank 2 nranks 16 cudaDev 2 busId 201c0 - Init COMPLETE
ip-172-31-13-103:9739:9804 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9739:9804 [1] NCCL INFO comm 0x7fed38000dc0 rank 1 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 0[101c0] -> 9[101d0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 0[101c0] -> 9[101d0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 01 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9748:9801 [7] NCCL INFO comm 0x7fbd04000dc0 rank 7 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
ip-172-31-13-103:9744:9798 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9744:9798 [6] NCCL INFO comm 0x7f23b8000dc0 rank 6 nranks 16 cudaDev 6 busId a01c0 - Init COMPLETE
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 8[101c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9921:9986 [0] NCCL INFO comm 0x7fdc3c000dc0 rank 8 nranks 16 cudaDev 0 busId 101c0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 9[101d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9922:9980 [1] NCCL INFO comm 0x7f17c8000dc0 rank 9 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9738:9797 [0] NCCL INFO comm 0x7faad8000dc0 rank 0 nranks 16 cudaDev 0 busId 101c0 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-13-103:9738:9738 [0] NCCL INFO Launch mode Parallel
8 2 float sum 79.04 0.00 0.00 4e-07 64.55 0.00 0.00 4e-07
16 4 float sum 63.60 0.00 0.00 4e-07 63.44 0.00 0.00 2e-07
32 8 float sum 64.62 0.00 0.00 2e-07 65.10 0.00 0.00 1e-07
64 16 float sum 65.01 0.00 0.00 1e-07 64.24 0.00 0.00 1e-07
128 32 float sum 65.24 0.00 0.00 1e-07 64.54 0.00 0.00 1e-07
256 64 float sum 65.57 0.00 0.01 1e-07 64.45 0.00 0.01 1e-07
512 128 float sum 66.94 0.01 0.01 1e-07 66.50 0.01 0.01 1e-07
1024 256 float sum 68.47 0.01 0.03 4e-07 68.17 0.02 0.03 4e-07
2048 512 float sum 72.68 0.03 0.05 4e-07 72.91 0.03 0.05 4e-07
4096 1024 float sum 78.42 0.05 0.10 4e-07 77.77 0.05 0.10 4e-07
8192 2048 float sum 83.65 0.10 0.18 4e-07 81.94 0.10 0.19 4e-07
16384 4096 float sum 96.39 0.17 0.32 4e-07 93.38 0.18 0.33 4e-07
32768 8192 float sum 116.7 0.28 0.53 4e-07 114.8 0.29 0.54 4e-07
65536 16384 float sum 155.3 0.42 0.79 4e-07 153.4 0.43 0.80 4e-07
131072 32768 float sum 203.0 0.65 1.21 4e-07 203.7 0.64 1.21 4e-07
262144 65536 float sum 315.6 0.83 1.56 4e-07 311.2 0.84 1.58 4e-07
524288 131072 float sum 409.9 1.28 2.40 4e-07 407.1 1.29 2.41 4e-07
1048576 262144 float sum 597.2 1.76 3.29 4e-07 594.6 1.76 3.31 4e-07
2097152 524288 float sum 926.9 2.26 4.24 4e-07 924.9 2.27 4.25 4e-07
4194304 1048576 float sum 1583.5 2.65 4.97 4e-07 1584.1 2.65 4.96 4e-07
8388608 2097152 float sum 2939.5 2.85 5.35 4e-07 2929.9 2.86 5.37 4e-07
16777216 4194304 float sum 5366.1 3.13 5.86 4e-07 5381.9 3.12 5.85 4e-07
33554432 8388608 float sum 10305 3.26 6.10 4e-07 10294 3.26 6.11 4e-07
67108864 16777216 float sum 20358 3.30 6.18 4e-07 20341 3.30 6.19 4e-07
134217728 33554432 float sum 39328 3.41 6.40 4e-07 39392 3.41 6.39 4e-07
268435456 67108864 float sum 77210 3.48 6.52 4e-07 77304 3.47 6.51 4e-07
536870912 134217728 float sum 152989 3.51 6.58 4e-07 152798 3.51 6.59 4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 2.32362
Hi the bandwidth is indeed too low for p4d. Two things worth looking into:
- Make sure you set the environment variable FI_EFA_USE_DEVICE_RDMA to 1. This would mean pass
-x FI_EFA_USE_DEVICE_RDMA=1
to yourmpirun
command. You should see[receive/send] via NET/AWS Libfabric/GDRDMA
- Make sure you have 4 EFA devices attached to each p4d instance. You can do this by running
lspci
on your instance.
Make sure you set the environment variable FI_EFA_USE_DEVICE_RDMA to 1. This would mean pass -x FI_EFA_USE_DEVICE_RDMA=1 to your mpirun command. You should see [receive/send] via NET/AWS Libfabric/GDRDMA
You also must be sure to pass the -g
option to the efa_installer.sh
on your host instance
Make sure you set the environment variable FI_EFA_USE_DEVICE_RDMA to 1. This would mean pass -x FI_EFA_USE_DEVICE_RDMA=1 to your mpirun command. You should see [receive/send] via NET/AWS Libfabric/GDRDMA
You also must be sure to pass the
-g
option to theefa_installer.sh
on your host instance
Hi @leezu
I didn't install EFA part from scratch, the EFA is installed by Deep Learning AMI (ubuntu18.04, ver43.0).
Do you mean I need to reinstall it?
Hi @wzamazon
I have passed the FI_EFA_USE_DEVICE_RDMA=1
flag to mpirun command, here is the launch script:
/opt/amazon/openmpi/bin/mpirun \
-n ${NUM_PROCS} -H ${HOSTS} \
-x RDMAV_FORK_SAFE=1 -x NCCL_DEBUG=info \
-x FI_EFA_USE_DEVICE_RDMA=1 \
--mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 512M -f 2 -g 1 -c 1 -n 20
And here is the lspci log, 4 EFA NIC are attached. I have also verified with fi_info -p efa
command.
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
10:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
10:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
10:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
10:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
10:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
10:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
20:01.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
20:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
20:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
20:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
20:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
20:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
80:1a.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1b.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1c.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1d.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1e.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1f.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
90:02.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
90:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
90:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
90:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
90:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
90:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
a0:03.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
a0:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
a0:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
a0:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
a0:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
a0:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
fi_info -p efa log:
provider: efa
fabric: EFA-fe80::c8:f6ff:fe4d:1df3
domain: rdmap16s27-rdm
version: 111.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::cc:b9ff:fe43:e655
domain: rdmap32s27-rdm
version: 111.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::7a:70ff:feea:56bb
domain: rdmap144s27-rdm
version: 111.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::b5:58ff:fe4c:a9f
domain: rdmap160s27-rdm
version: 111.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::c8:f6ff:fe4d:1df3
domain: rdmap16s27-dgrm
version: 111.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::cc:b9ff:fe43:e655
domain: rdmap32s27-dgrm
version: 111.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::7a:70ff:feea:56bb
domain: rdmap144s27-dgrm
version: 111.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::b5:58ff:fe4c:a9f
domain: rdmap160s27-dgrm
version: 111.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
What version of aws-ofi-nccl plugin are you using?
What version of aws-ofi-nccl plugin are you using?
v1.1.2,
compiled with following commands:
./autogen.sh --prefix=/usr/local --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --with-nccl=/usr/local/cuda --with-mpi=/opt/amazon/openmpi
./configure --prefix=/usr/local --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --with-nccl=/usr/local/cuda --with-mpi=/opt/amazon/openmpi
make
sudo make install
check you LD_LIBRARY_PATH, it maybe the aws-ofi-nccl plugin you compiled was not picked up.
On p4d platform, there should be a line like
ip-192-168-2-54:14:14 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/ec2-user/install/plugin/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
in the log. According to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-dlami.html
LD_LIBRARY_PATH
The LD_LIBRARY_PATH includes the /usr/local/lib
, where I have installed the plugin.
where can I find the p4d-24xl-topo.xml
?
I can try to pass the file to nccl manually
p4d-24xl-topo.xml
should be part of aws-ofi-nccl plugin
p4d-24xl-topo.xml
should be part of aws-ofi-nccl plugin
I didn't find the file in the source code.
while the bandwidth issues solved with pre-compiled aws-ofi-nccl
inside /usr/local/cuda-11.0/efa/
folder.
Here is the command:
/opt/amazon/openmpi/bin/mpirun \
-n ${NUM_PROCS} -H ${HOSTS} \
-x FI_EFA_USE_DEVICE_RDMA=1 -x RDMAV_FORK_SAFE=1 --mca pml ^cm \
-x LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x FI_PROVIDER="efa" --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 512M -f 2 -g 1 -c 1 -n 20
So the plugin I compiled is possibly missed some configuration.
The file is only present on the aws
branch (https://github.com/aws/aws-ofi-nccl/blob/aws/topology/p4d-24xl-topo.xml). Is it possible that code from the wrong branch was compiled?
The file is only present on the
aws
branch (https://github.com/aws/aws-ofi-nccl/blob/aws/topology/p4d-24xl-topo.xml). Is it possible that code from the wrong branch was compiled?
What is the difference between master
branch and aws
branch? I thought this plugin is only for aws-platform
Compiling source code at aws
branch solves the issue
I followed up on your question about the difference between master
and aws
branches. The long-term goal of this project is to make a plugin that can be used to connect NCCL to any libfabric provider. Eventually the plan is to merge the AWS-specific branch into master and discontinue it (but no timeline on this at present).