ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/efa: fi_info crash in a system with mlnx but no efa defice

chien-intel opened this issue · comments

Describe the bug
libfabric src on 54edc09, configured with debug and valgrind.

Running fi_info on a system with a Mellanox device without Efa produced this output:

fi_info
double free or corruption (!prev)
Aborted (core dumped)

Here is the gdb stack trace:
(gdb) bt
#0 0x00007ffff5e1437f in raise () from //lib64/libc.so.6
#1 0x00007ffff5dfedb5 in abort () from //lib64/libc.so.6
#2 0x00007ffff5e574e7 in __libc_message () from //lib64/libc.so.6
#3 0x00007ffff5e5e5ec in malloc_printerr () from //lib64/libc.so.6
#4 0x00007ffff5e6039c in _int_free () from //lib64/libc.so.6
#5 0x00007ffff4cfd3f2 in mlx5_free_context (ibctx=0x676190) at providers/mlx5/mlx5.c:1407
#6 0x00007ffff6bf08b5 in _ibv_close_device_1_1 (context=) at libibverbs/device.c:384
#7 0x00007ffff77b3ca7 in efa_device_destruct (device=0x66fb20) at prov/efa/src/efa_device.c:180
#8 0x00007ffff77b3ecd in efa_device_list_finalize () at prov/efa/src/efa_device.c:254
#9 0x00007ffff77b3e54 in efa_device_list_initialize () at prov/efa/src/efa_device.c:237
#10 0x00007ffff77c28ea in efa_prov_initialize () at prov/efa/src/efa_fabric.c:269
#11 0x00007ffff77c92dd in fi_efa_ini () at prov/efa/src/rxr/rxr_prov.c:111
#12 0x00007ffff770d6ff in fi_ini () at src/fabric.c:856
#13 0x00007ffff770e093 in fi_getinfo
(version=65552, node=0x0, service=0x0, flags=0, hints=0x0, info=0x7fffffffd740) at src/fabric.c:1101
#14 0x0000000000401cc0 in run (hints=0x0, node=0x0, port=0x0, flags=0) at util/info.c:324
#15 0x0000000000402110 in main (argc=1, argv=0x7fffffffd888) at util/info.c:448

To Reproduce
Use libfabric src on sha 54edc09, configured with debug and valgrind and run fi_info on a system with mellanox but no efa. Probably any verbs capable device will do, other than efa.

Expected behavior
fi_info to display info and not crash

Output
see description.

Environment:
Reproduced on RHEL 8.2 and 8.5 with Mellanox and rdma-core installed.

Additional context
Add any other context about the problem here.

@ofiwg/aws-efa-team

looking into it.

#7806 should fix the issue. @chien-intel would you please try this patch?

PR #7806 fixed this issue. Feel free to close this issue after PR is merged.

Thank you! will merge after CI finish.

PR merged. I also checked that this issue only apply to main branch, therefore no backport is needed.

Closing ...

thank you.