prov/tcp: reported IP addresses are not consistent
shefty opened this issue · comments
This likely impacts multiple providers and is unlikely a direct problem with libfabric. The following snapshot shows the issue, where an ipv4 is listed first in one case, but ipv6 in another. These runs are back to back on the same system. In the majority of cases, the first address is usually an ipv6 address.
[~]$ fi_info -v -p tcp | grep src_addr:
src_addr: fi_sockaddr_in://192.168.4.6:0
src_addr: fi_sockaddr_in6://[fe80::a6bf:1ff:fe64:3f8f]:0
src_addr: fi_sockaddr_in://127.0.0.1:0
src_addr: fi_sockaddr_in6://[::1]:0
src_addr: fi_sockaddr_in://192.168.4.6:0
src_addr: fi_sockaddr_in6://[fe80::a6bf:1ff:fe64:3f8f]:0
src_addr: fi_sockaddr_in://127.0.0.1:0
src_addr: fi_sockaddr_in6://[::1]:0
src_addr: fi_sockaddr_in://192.168.4.6:0
src_addr: fi_sockaddr_in6://[fe80::a6bf:1ff:fe64:3f8f]:0
src_addr: fi_sockaddr_in://127.0.0.1:0
src_addr: fi_sockaddr_in6://[::1]:0
src_addr: fi_sockaddr_in://192.168.4.6:0
src_addr: fi_sockaddr_in6://[fe80::a6bf:1ff:fe64:3f8f]:0
src_addr: fi_sockaddr_in://127.0.0.1:0
src_addr: fi_sockaddr_in6://[::1]:0
[~]$ fi_info -v -p tcp | grep src_addr:
src_addr: fi_sockaddr_in6://[fe80::3188:9e34:2bb1:be38]:0
src_addr: fi_sockaddr_in://192.168.4.6:0
src_addr: fi_sockaddr_in6://[fe80::a6bf:1ff:fe64:3f8f]:0
src_addr: fi_sockaddr_in6://[fe80::43ad:7dff:e225:310]:0
src_addr: fi_sockaddr_in://127.0.0.1:0
src_addr: fi_sockaddr_in6://[::1]:0
src_addr: fi_sockaddr_in6://[fe80::3188:9e34:2bb1:be38]:0
src_addr: fi_sockaddr_in://192.168.4.6:0
src_addr: fi_sockaddr_in6://[fe80::a6bf:1ff:fe64:3f8f]:0
src_addr: fi_sockaddr_in6://[fe80::43ad:7dff:e225:310]:0
src_addr: fi_sockaddr_in://127.0.0.1:0
src_addr: fi_sockaddr_in6://[::1]:0
src_addr: fi_sockaddr_in6://[fe80::3188:9e34:2bb1:be38]:0
src_addr: fi_sockaddr_in://192.168.4.6:0
src_addr: fi_sockaddr_in6://[fe80::a6bf:1ff:fe64:3f8f]:0
src_addr: fi_sockaddr_in6://[fe80::43ad:7dff:e225:310]:0
src_addr: fi_sockaddr_in://127.0.0.1:0
src_addr: fi_sockaddr_in6://[::1]:0
src_addr: fi_sockaddr_in6://[fe80::3188:9e34:2bb1:be38]:0
src_addr: fi_sockaddr_in://192.168.4.6:0
src_addr: fi_sockaddr_in6://[fe80::a6bf:1ff:fe64:3f8f]:0
src_addr: fi_sockaddr_in6://[fe80::43ad:7dff:e225:310]:0
src_addr: fi_sockaddr_in://127.0.0.1:0
src_addr: fi_sockaddr_in6://[::1]:0
It takes several executions for this to occur. The impact is that it causes sporadic failures with MPI. Both OMPI and IMPI have seen issues during startup with failures that have been traced back to one rank using ipv4 addressing, but a peer using ipv6. This results in failures during connect.
Problem is something with the system or kernel. See differences in ip addr output:
ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether a4:bf:01:64:3f:8f brd ff:ff:ff:ff:ff:ff
inet 192.168.4.6/24 brd 192.168.4.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::a6bf:1ff:fe64:3f8f/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether a4:bf:01:64:3f:90 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether b4:96:91:91:73:70 brd ff:ff:ff:ff:ff:ff
5: eth3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether b4:96:91:91:73:71 brd ff:ff:ff:ff:ff:ff
Same system a couple seconds later:
ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether a4:bf:01:64:3f:8f brd ff:ff:ff:ff:ff:ff
inet 192.168.4.6/24 brd 192.168.4.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::a6bf:1ff:fe64:3f8f/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether a4:bf:01:64:3f:90 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether b4:96:91:91:73:70 brd ff:ff:ff:ff:ff:ff
inet6 fe80::3188:9e34:2bb1:be38/64 scope link noprefixroute
valid_lft forever preferred_lft forever
5: eth3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
link/ether b4:96:91:91:73:71 brd ff:ff:ff:ff:ff:ff
inet6 fe80::43ad:7dff:e225:310/64 scope link tentative noprefixroute
valid_lft forever preferred_lft forever
New ipv6 address is reported for eth3, though the link is down?
Problem is a system related issue and unrelated to libfabric.