ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/verbs;ofi_rxm: rxm_handle_error():793<warn> fi_eq_readerr: err: Connection refused (111), prov_err: Unknown error -8 (-8)

bfaccini opened this issue · comments

Describe the bug
We experience a huge performance slow-down when running mdtest-hard-read phase of IO500 with 32 client nodes, and 64 tasks/node, vs a single DAOS server.

There are tons/billions of such libfabric log messages sequence during the same period of time :
libfabric:1522427:1657212692::ofi_rxm:ep_ctrl:rxm_handle_error():793 fi_eq_readerr: err: Connection refused (111), prov_err: Unknown error -8 (-8)
libfabric:1522427:1657212692::ofi_rxm:ep_ctrl:rxm_process_reject():543 Processing reject for handle: 0x7f19d4984880
libfabric:1522427:1657212692::ofi_rxm:ep_ctrl:rxm_process_reject():565 closing conn 0x7f19d4984880, reason 1

and taking some server stack dumps on the fly, always shows threads with the following stack/context :
#0 0x00007fc6270ad62b in ioctl () from /lib64/libc.so.6
#1 0x00007fc620d5325a in execute_ioctl () from /lib64/libibverbs.so.1
#2 0x00007fc620d525bf in _execute_ioctl_fallback () from /lib64/libibverbs.so.1
#3 0x00007fc620d54dbf in ibv_cmd_destroy_qp () from /lib64/libibverbs.so.1
#4 0x00007fc5f41fbe59 in mlx5_destroy_qp () from /usr/lib64/libibverbs/libmlx5-rdmav34.so
#5 0x00007fc620b336a1 in rdma_destroy_qp () from /lib64/librdmacm.so.1
#6 0x00007fc620b35dd4 in rdma_destroy_ep () from /lib64/librdmacm.so.1
#7 0x00007fc6215e9d8c in vrb_ep_close (fid=0x7fbee113c420) at prov/verbs/src/verbs_ep.c:513
#8 0x00007fc62160b083 in fi_close (fid=) at ./include/rdma/fabric.h:603
#9 rxm_close_conn (conn=) at prov/rxm/src/rxm_conn.c:88
#10 rxm_close_conn (conn=0x7fc3e507ed38) at prov/rxm/src/rxm_conn.c:58
#11 0x00007fc62160c01d in rxm_process_reject (entry=0x47e3340, entry=0x47e3340, conn=) at prov/rxm/src/rxm_conn.c:446
#12 rxm_handle_error (ep=ep@entry=0x7fc218055fe0) at prov/rxm/src/rxm_conn.c:660
#13 0x00007fc62160c3a0 in rxm_conn_progress (ep=ep@entry=0x7fc218055fe0) at prov/rxm/src/rxm_conn.c:703
#14 0x00007fc62160c465 in rxm_get_conn (ep=ep@entry=0x7fc218055fe0, addr=addr@entry=1176, conn=conn@entry=0x47e34a8) at prov/rxm/src/rxm_conn.c:393
#15 0x00007fc62161160d in rxm_ep_tsend (ep_fid=0x7fc218055fe0, buf=, len=, desc=, dest_addr=1176, tag=409959, context=0x7fc2185c13f8) at prov/rxm/src/rxm_ep.c:2120
#16 0x00007fc625bb7aec in na_ofi_progress () from /lib64/libna.so.2
#17 0x00007fc625baee73 in NA_Progress () from /lib64/libna.so.2
#18 0x00007fc625fe597e in hg_core_progress_na () from /lib64/libmercury.so.2
#19 0x00007fc625fe7f53 in hg_core_progress () from /lib64/libmercury.so.2
#20 0x00007fc625fed05b in HG_Core_progress () from /lib64/libmercury.so.2
#21 0x00007fc625fdf193 in HG_Progress () from /lib64/libmercury.so.2
#22 0x00007fc628a41616 in crt_hg_progress (hg_ctx=hg_ctx@entry=0x7fc218049138, timeout=timeout@entry=0) at src/cart/crt_hg.c:1285
#23 0x00007fc628a01607 in crt_progress (crt_ctx=0x7fc218049120, timeout=0) at src/cart/crt_context.c:1472
#24 0x000000000043c8fd in dss_srv_handler (arg=0x46f2a30) at src/engine/srv.c:486
#25 0x00007fc627c72ece in ABTD_ythread_func_wrapper (p_arg=0x47e3ce0) at arch/abtd_ythread.c:21
#26 0x00007fc627c73071 in make_fcontext () from /lib64/libabt.so.1

which confirms the messages being logged and also the performance cost since Kernel is involved to access the IB board and remove QPs.

To Reproduce
Steps to reproduce the behavior:
This is 100% reproducible with the indicated configuration before.
libfabric-1.15.1 is being used with MOFED 5.5

Expected behavior
If needed, a clear and concise description of what you expected to happen.

Output
If applicable, add output to help explain your problem. (e.g. backtrace, debug logs)

Environment:
OS (if not Linux), provider, endpoint type, etc.

Additional context
Add any other context about the problem here.

Driver is always involved when creating/destroying a QP with a verbs capable device.
Can you give me 2 additional data point? Please run "watch -n 1 rdma resource" and report back qp number at the time of error. It should look something like this (with bigger numbers): 0: mlx5_0: pd 2 cq 26 qp 15 cm_id 0 mr 0 ctx 0 srq 0
Also if you can enable a print from Mellanox driver with this:
echo "func mlx5_cmd_check +p" > /sys/kernel/debug/dynamic_debug/control
and check dmesg output for anything from that function.

Hello @chien-intel and first of all thanks for your quick reply and ideas!

Unfortunately, the echo "func mlx5_cmd_check +p" > /sys/kernel/debug/dynamic_debug/control :

[root@nvm0805 ~]# grep mlx5_cmd_check /sys/kernel/debug/dynamic_debug/control
/tmp/MLNX_OFED_LINUX-5.5-1.0.3.2-4.18.0-305.25.1.el8_4.x86_64/mlnx_iso.12643/OFED_topdir/BUILD/mlnx-ofa_kernel-5.5/obj/default/drivers/net/ethernet/mellanox/mlx5/core/cmd.c:827 [mlx5_core]mlx5_cmd_check =p "%s:%d:(pid %d): %s(0x%x) op_mod(0x%x) failed, status %s(0x%x), syndrome (0x%x)\012"

did not produce any log/trace in dmesg during the critical phase :-(

And you will find attached the output (rdma_resource.2022-07-12T111812.nvm0805.log) of the rdma resource command running each second with an interleaved timestamp string every minute. The time window where the rxm_handle_error() msgs did show up is between 07/12-11:43 and 07/12-12:36.

The io500 compute job we use to reproduce, used 32 nodes with 64 tasks/node, connected to a single DAOS server/node with 2 server-instances/engines for 20 targets/engine (one dedicated HDR interface per each engine), so leading to 80K concurrent connections (40K on each HDR). This seems to be coherent to the number of QPs that is being reported for each IB interface.

Can you double check your system setup and make sure all nodes are using the same version of software?
From the stack trace, #15 rxm_ep_tsend is not in rxm_ep.c file. That code got moved with commit 9014b21 since v1.15.0 release. I suspect you have mismatching software on your nodes. Please re-run the test after ensuring all nodes are running the same software.

From the stack trace, #15 rxm_ep_tsend is not in rxm_ep.c file. That code got moved with commit 9014b21 since v1.15.0 release. I suspect you have mismatching software on your nodes.

Oops, I need to apologize, but the unwinded stack I have added comes from our first investigations when running with libfrabrix 1.14.0 :-( , sorry about that, and I will get a new version of it and attached it here soon.

So the new (ie with libfabric-1.15.1) stack of the DAOS engine threads very likely suspected to participate with the performance degradation now looks like :

#0  0x00007f90ed22862b in ioctl () from /lib64/libc.so.6
#1  0x00007f90e68342fa in execute_ioctl () from /lib64/libibverbs.so.1
#2  0x00007f90e683365f in _execute_ioctl_fallback () from /lib64/libibverbs.so.1
#3  0x00007f90e6835e5f in ibv_cmd_destroy_qp () from /lib64/libibverbs.so.1
#4  0x00007f90b03e1ec9 in mlx5_destroy_qp () from /usr/lib64/libibverbs/libmlx5-rdmav34.so
#5  0x00007f90e66146a1 in rdma_destroy_qp () from /lib64/librdmacm.so.1
#6  0x00007f90e6616dd4 in rdma_destroy_ep () from /lib64/librdmacm.so.1
#7  0x00007f90e7959752 in vrb_ep_close (fid=0x7edb72d2d940) at prov/verbs/src/verbs_ep.c:521
#8  0x00007f90e797b453 in fi_close (fid=<optimized out>) at ./include/rdma/fabric.h:601
#9  rxm_close_conn (conn=<optimized out>) at prov/rxm/src/rxm_conn.c:88
#10 rxm_close_conn (conn=0x7f90883f8e40) at prov/rxm/src/rxm_conn.c:58
#11 0x00007f90e797c525 in rxm_process_reject (entry=0x56de9c0, entry=0x56de9c0, conn=<optimized out>) at prov/rxm/src/rxm_conn.c:563
#12 rxm_handle_error (ep=<optimized out>) at prov/rxm/src/rxm_conn.c:795
#13 0x00007f90e797c910 in rxm_conn_progress (ep=ep@entry=0x7f8cd4055ff0) at prov/rxm/src/rxm_conn.c:838
#14 0x00007f90e797c9d5 in rxm_get_conn (ep=ep@entry=0x7f8cd4055ff0, addr=addr@entry=34972, conn=conn@entry=0x56deb38) at prov/rxm/src/rxm_conn.c:475
#15 0x00007f90e79830bd in rxm_tsend (ep_fid=0x7f8cd4055ff0, buf=<optimized out>, len=<optimized out>, desc=<optimized out>, dest_addr=34972, tag=13380, context=0x7e9ffece0ca8) at prov/rxm/src/rxm_tagged.c:290
#16 0x00007f90ebd2a19c in na_ofi_progress () from /lib64/libna.so.2
#17 0x00007f90ebd21123 in NA_Progress () from /lib64/libna.so.2
#18 0x00007f90ec15f99e in hg_core_progress_na () from /lib64/libmercury.so.2
#19 0x00007f90ec161f73 in hg_core_progress () from /lib64/libmercury.so.2
#20 0x00007f90ec16707b in HG_Core_progress () from /lib64/libmercury.so.2
#21 0x00007f90ec1591b3 in HG_Progress () from /lib64/libmercury.so.2
#22 0x00007f90eebb2686 in crt_hg_progress (hg_ctx=hg_ctx@entry=0x7f8cd4049148, timeout=timeout@entry=0) at src/cart/crt_hg.c:1400
#23 0x00007f90eeb78077 in crt_progress (crt_ctx=0x7f8cd4049130, timeout=0) at src/cart/crt_context.c:1447
#24 0x000000000043c5c7 in dss_srv_handler (arg=0x55edce0) at src/engine/srv.c:519
#25 0x00007f90eddfeece in ABTD_ythread_func_wrapper (p_arg=0x56df360) at arch/abtd_ythread.c:21
#26 0x00007f90eddff071 in make_fcontext () from /lib64/libabt.so.1
#27 0x0000000000000000 in ?? ()

is the stack trace from the client or server?
Looking at #12, rxm_handle_error - rxm_conn.c:795, we got an ECONNREFUSED error. So we have to look at what's happening on the other side. One thing I didn't ask is your "ulimit -a" setting and memory situation when it starts to degrade. Make sure you enable mlx5_cmd_check print (on the other side) and look for any CREATE_QP error in dmesg or lowercase create_qp in /var/log/messages.

Stack is from server side.
Will try to gather these other infos soon.

Sorry for the delay.
Here are following the new infos that you had requested.

Ulimit from Server side, same values are displayed for each DAOS server/engine :

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1029101              1029101              processes 
Max open files            1048576              1048576              files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1029101              1029101              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
[root@nvm0803 ~]# 

Ulimit from clients side (I have checked that, as expected, all clients run with the same environment) always looks like for each process/task :

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          257477836800         257477836800         bytes     
Max processes             1029957              1029957              processes 
Max open files            512000               512000               files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1029957              1029957              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

But no "create_qp" msgs in dmesg nor in /var/log/messages, and at both server/clients sides.
The mlx5_cmd_check debug has never triggered at both server/clients sides.
When the same stacks still could be seen on server side along with the tons of rxm_handle_error()/rxm_process_reject() msgs (and even continuing for ever after all clients have exited !!), but which do not show-up on clients side !!
Memory consumption has been monitored and do not bump when the problem occurs at both clients/server side.

what you said at the end is interesting. If server is trying to send messages to clients after(?) all clients have exited then of course server would get the reject. What is the server side trying to do? What was the last successful operation and the first failure on the client side before it exited?

If server is trying to send messages to clients after(?) all clients have exited then of course server would get the reject.

But the reject messages have started in a loop a long time ago where clients task still had to run for a while.

What is the server side trying to do?

Well, the stack should tell you.

What was the last successful operation and the first failure on the client side before it exited?

When I say the clients have exited, I mean graceful exit, not due to some error/exception. The problem here is performance issue due to overhead implied by the unexplained behaviour described.

I'm standing in for Bruno who is on vacation.

The cluster on which this is reproduced/debugged has meanwhile been upgraded to MLNX_OFED_LINUX-5.6-2.0.9.0 and daos-2.1.104 which includes libfabric-1.15.1-2.el8.x86_64. No change in behavior.

Can we work on an action plan how to further debug this performance issue? This is severely limiting the scalability of the DAOS storage solution in terms of number of client tasks (by at least 10x), and we need to get to the root cause of the issue.

Thanks @chien-intel for the constructive call. Action plan:

  • run the mdtest-easy-write/stat/delete tests in isolation, outside of IO500 which runs all 12 tests, to avoid connection tear-down and re-establishment between the IO500 phases
  • monitor "rdma resource" from before the job starts (capture idle state) to at least 10min after the job finished (must run longer than a mlx timeout of 7min)
  • run the benchmark with tcp provider, and compare scaling to RDMA/verbs
  • @chien-intel to send max-qp testing script to validate that mlx5_cmd_check works

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.

Remove stale label or comment, otherwise it will be closed in 7 days.

Well, how I can I do that ??

You just did by adding a comment.

Have not observed this issue with latest version of DAOS, closing.