fi_rdm_shared_av does not work with verbs providers
nmorey opened this issue · comments
Running libfabric/fabtests 1.4.2 on SUSE SLES12-SP3
Running a the fi_rdm_shared_av test in client/server mode over verbs fails with a segfault:
wingenfelder:~/:[0]# fi_rdm_shared_av -p verbs -s 192.168.0.1
janacek:~/:[0]# gdb --args fi_rdm_shared_av -p verbs -s 192.168.0.2 192.168.0.1
(gdb) set follow-fork-mode child
(gdb) r
Starting program: /usr/bin/fi_rdm_shared_av -p verbs -s 192.168.0.2 192.168.0.1
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-61.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New process 27398]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff3e63700 (LWP 27410)]
Thread 2.1 "fi_rdm_shared_a" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7fdb780 (LWP 27398)]
0x0000000000401a3f in run () at simple/rdm_shared_av.c:141
141 simple/rdm_shared_av.c: No such file or directory.
Missing separate debuginfos, use: zypper install libibverbs-debuginfo-14-6.7.x86_64 libibverbs1-debuginfo-14-6.7.x86_64 libinfinipath4-debuginfo-3.3-7.7.x86_64 libnl3-200-debuginfo-3.2.23-2.21.x86_64 libpsm2-2-debuginfo-10.2.103-2.6.x86_64 libpsm_infinipath1-debuginfo-3.3-7.7.x86_64 librdmacm1-debuginfo-14-6.7.x86_64 libuuid1-debuginfo-2.29.2-2.3.x86_64
(gdb) bt
#0 0x0000000000401a3f in run () at simple/rdm_shared_av.c:141
#1 main (argc=6, argv=<optimized out>) at simple/rdm_shared_av.c:196
Error is on this line:
remote_fi_addr = ((fi_addr_t *)av_attr.map_addr)[0];
Looking into the code, it seems only the socket providers fills the map_addr (and the test works over sockets).
A quick look at the 1.5.0rc1 code seems to show that the bug will still be there ( haven't tried it yet)
The fi_rdm_shared_av test in fabtests 1.5rc1 checks for the FI_SHARED_AV capability and exits if the provider doesn't support it.
I'll update the package for SUSE to 1.5 and check that. Thanks
This is fixed in 1.5.0rc1 but this test now fails:
wingenfelder:/tmp/:[61]# fi_rma_bw -e rdm -o writedata -I 5 -p "verbs" -s 192.168.0.1 192.168.0.2
fi_inject_writedata(): common/shared.c:1503, ret=-38 (Function not implemented)
``
fi_rma_bw -e rdm -o writedata
test is not supported by verbs/RDM. It is however supported by ofi_rxm over verbs. You can run the test with fi_rma_bw -e rdm -o writedata -p "ofi_rxm;verbs"
. ofi_rxm is an "utility" provider that emulates a RDM endpoint over MSG endpoint of a core provider.
I don't expect it to work over verbs, but I expect to be able to run the testsuite using runfabtests without it failing, which is not the case now
There is a plan to make runfabtests.sh run only those tests supported by a provider. That change would make it to the repo sometime later though.