daos file system plug network cable timeout link can not

Question

daos file system plug network cable timeout link can not

xqcool opened this issue 2 years ago · comments

Describe the bug
After unplugging the network cable of the engine port of daos file system, wait for 1 minute, plug in the network cable, the network has been timeout or unavailable

To Reproduce
1/ Start daos_server
2/ dmg storage format
3/ Unplug the network cable of an engine
4/ wait for 1 minute
5/ plug in the network cable
6/mercury timeout or transport layer error

Expected behavior
Plug in the network cable network recovery

Output
If applicable, add output to help explain your problem. (e.g. backtrace, debug logs)

Environment:
centos7.9

Additional context
Add any other context about the problem here.

xqcool · Answer 1 · Mon Jun 06 2022 16:05:57 GMT+0800 (China Standard Time)

DAOS: verbs;rxm

Chien Tin Tung · Answer 2 · Mon Jun 06 2022 22:09:23 GMT+0800 (China Standard Time)

When running over verbs, please be specific on type of network controller used in the test case. This is relevant, especially with link events, the device driver for the network controller is handling that situation and may not work the same from drivers for other verbs capable device.
If possible, provide modinfo for the device driver or infiniband software installed in the system (from CentOS 7.0 or MOFED for example).

xqcool · Answer 3 · Tue Jun 07 2022 09:37:28 GMT+0800 (China Standard Time)

When running over verbs, please be specific on type of network controller used in the test case. This is relevant, especially with link events, the device driver for the network controller is handling that situation and may not work the same from drivers for other verbs capable device. If possible, provide modinfo for the device driver or infiniband software installed in the system (from CentOS 7.0 or MOFED for example).

mlx5

xqcool · Answer 4 · Tue Jun 07 2022 09:39:27 GMT+0800 (China Standard Time)

Using 'ofi+sockets' network type, the program works fine

xqcool · Answer 5 · Tue Jun 07 2022 09:42:04 GMT+0800 (China Standard Time)

daos version 2.02

xqcool · Answer 6 · Tue Jun 07 2022 10:11:34 GMT+0800 (China Standard Time)

When running over verbs, please be specific on type of network controller used in the test case. This is relevant, especially with link events, the device driver for the network controller is handling that situation and may not work the same from drivers for other verbs capable device. If possible, provide modinfo for the device driver or infiniband software installed in the system (from CentOS 7.0 or MOFED for example).

Jianxin Xiong · Answer 7 · Wed Jun 08 2022 01:09:26 GMT+0800 (China Standard Time)

1 minute is long enough for a Verbs op to time out. Does DAOS/Mercury retry after timeout?

xqcool · Answer 8 · Wed Jun 08 2022 07:07:42 GMT+0800 (China Standard Time)

1 minute is long enough for a Verbs op to time out. Does DAOS/Mercury retry after timeout?

Always in timeout state, retry only dping message, but will time out, need special retry (similar to TCP reconnection)?

xqcool · Answer 9 · Wed Jun 08 2022 15:48:56 GMT+0800 (China Standard Time)

But there is no problem with 'ofi + sockets'

xqcool · Answer 10 · Wed Jun 08 2022 18:58:42 GMT+0800 (China Standard Time)

xqcool commented 2 years ago

Chien Tin Tung · Answer 11 · Wed Jun 08 2022 20:51:43 GMT+0800 (China Standard Time)

I need "modinfo mlx5" not ibstat. That info is necessary so I can look at the correct version of the source code to help you. Without correct and detailed issue description it is difficult to narrow down the problem.

xqcool · Answer 12 · Thu Jun 09 2022 10:19:26 GMT+0800 (China Standard Time)

I need "modinfo mlx5" not ibstat. That info is necessary so I can look at the correct version of the source code to help you. Without correct and detailed issue description it is difficult to narrow down the problem.

Hi, checked the version information, daos version number: v2.0 (c45320fb33)
mercury version number: v2.1.0rc4 (43abc30462)
fabric version number: v1.14.0 (119622b)

xqcool · Answer 13 · Fri Jun 10 2022 17:25:09 GMT+0800 (China Standard Time)

1 minute is long enough for a Verbs op to time out. Does DAOS/Mercury retry after timeout?

Unplugging the network cable will prompt a send error, and plugging the cable keeps prompting a timeout. There should be a retry, but sending is abnormal.

Chien Tin Tung · Answer 14 · Sat Jun 11 2022 22:34:04 GMT+0800 (China Standard Time)

But there is no problem with 'ofi + sockets'
Actually the behavior is the same if you wait long enough. Disconnect the cable for 16 minutes and same thing should happen.
You are hitting retransmission timeout. With verbs and mlx5, the retry_cnt attribute on the QP (connection) is set to 7 (max and default value) which according to the table in this webpage (https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html) is 25.4 seconds. So vary the time to cable reconnection and see if this holds.

Chien Tin Tung · Answer 15 · Thu Jun 16 2022 22:07:52 GMT+0800 (China Standard Time)

did you get the same behavior with "ofi+sockets" after waiting 16 minutes?

xqcool · Answer 16 · Fri Jun 24 2022 17:27:20 GMT+0800 (China Standard Time)

But there is no problem with 'ofi + sockets'
Actually the behavior is the same if you wait long enough. Disconnect the cable for 16 minutes and same thing should happen.
You are hitting retransmission timeout. With verbs and mlx5, the retry_cnt attribute on the QP (connection) is set to 7 (max and default value) which according to the table in this webpage (https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html) is 25.4 seconds. So vary the time to cable reconnection and see if this holds.

I changed the number of retries to be the same as tcp, the same operation can be recovered within 16 minutes, sockets can be recovered, but verbs; rxm can not be recovered

xqcool · Answer 17 · Mon Jun 27 2022 14:10:21 GMT+0800 (China Standard Time)

did you get the same behavior with "ofi+sockets" after waiting 16 minutes?

Using ofi+sockets mode, after more than 20 minutes, the service of plugging in the Internet cable will return to normal.

wangzhuo1016 · Answer 18 · Mon Aug 08 2022 10:15:19 GMT+0800 (China Standard Time)

Is there any progress about this issue? I'm also facing the same problem.

Liu Xuezhao · Answer 19 · Wed Nov 09 2022 16:34:03 GMT+0800 (China Standard Time)

@wangzhuo1016 just confirm did you test with latest libfabric + mercury? (or use latest daos master branch code)

Alexander Oganezov · Answer 20 · Wed Nov 09 2022 16:36:45 GMT+0800 (China Standard Time)

@wangzhuo1016 Also did you try with ofi+tcp;ofi_rxm ?

wangzhuo1016 · Answer 21 · Wed Nov 16 2022 09:48:36 GMT+0800 (China Standard Time)

@wangzhuo1016 just confirm did you test with latest libfabric + mercury? (or use latest daos master branch code)

daos v2.0 with libfabric v1.15.1 + mercury 2.2.0-3 , roce v2, provider:verbs+rxm， daos can't recover to normal

Liu Xuezhao · Answer 22 · Wed Nov 16 2022 11:17:22 GMT+0800 (China Standard Time)

@wangzhuo1016 for the OFI level issue, may need @chien-intel to confirm if can be refined.

On the other hand, at DAOS level, you may test if can workaround it by a few other steps, for your steps -
1/ Start daos_server
2/ dmg storage format
3/ Unplug the network cable of an engine
4/ wait for 1 minute
5/ plug in the network cable

add these steps after step 5,
"./dmg -o ./daos_control.yml system query -v" should be able to show that server rank be excluded after swim detects the network broken and evict it.
stop and restart that server rank by command like -
./dmg -o ./daos_control.yml -d system stop --force --ranks=
./dmg -o ./daos_control.yml -d system start --ranks=
and then reintegrate it back to system
./dmg -o ./daos_control.yml -d pool reintegrate --pool= --rank=

Lukasz Dorau · Answer 23 · Mon Nov 21 2022 16:29:12 GMT+0800 (China Standard Time)

But there is no problem with 'ofi + sockets'
Actually the behavior is the same if you wait long enough. Disconnect the cable for 16 minutes and same thing should happen.
You are hitting retransmission timeout. With verbs and mlx5, the retry_cnt attribute on the QP (connection) is set to 7 (max and default value) which according to the table in this webpage (https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html) is 25.4 seconds. So vary the time to cable reconnection and see if this holds.

@xqcool I have tested your "To Reproduce" scenario for the following providers and time periods of unplugging the network cable:

ofi+verbs;ofi_rxm - at least 30 seconds
ofi+tcp;ofi_rxm - at least 30 minutes
ofi+sockets - at least 30 minutes

The behavior of all providers is consistent - when the retransmission timeout (different for different providers) has passed the mercury timeout or transport layer error occurs. The only difference is a time to error which we do not have control of at the libfabric level. In case of verbs it is only about 30 seconds (exactly 25.4 seconds), while in case of tcp and sockets it is much longer - after 30 minutes the same happens.

I have tested it with the DAOS v2.2.0 built from source (commit d2a1f2790c946659c9398926254e6203fd957b7c).

wangzhuo1016 · Answer 24 · Mon Nov 21 2022 17:20:59 GMT+0800 (China Standard Time)

But there is no problem with 'ofi + sockets'
Actually the behavior is the same if you wait long enough. Disconnect the cable for 16 minutes and same thing should happen.
You are hitting retransmission timeout. With verbs and mlx5, the retry_cnt attribute on the QP (connection) is set to 7 (max and default value) which according to the table in this webpage (https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html) is 25.4 seconds. So vary the time to cable reconnection and see if this holds.

did you get the same behavior with "ofi+sockets" after waiting 16 minutes?

roce v2 , provider: ofi+verbs;ofi_rxm, when the network recovery from fault(ifdown/ifup network interface ), I call ibv_query_qp to get rdma qp'state , and find the state is IBV_QOS_ERR, but the libfabric level doesn't prevent this case from spreading .So when network recovery from fault, qp is always in a error state, and that causes mesages timeout in mercury and daos cart level @chien-intel

Chien Tin Tung · Answer 25 · Tue Nov 22 2022 00:15:38 GMT+0800 (China Standard Time)

I had to get @ldorau to help me recreate the issue with all 3 providers to get the complete picture not just verbs. We are discussing the next step for this issue.

Chien Tin Tung · Answer 26 · Tue Nov 22 2022 00:21:33 GMT+0800 (China Standard Time)

BTW, doing a ifdown/ifup on an interface is different from physically pulling the cable. To a certain point, it is also different from how we are reproducing the issue with mlxlink utility (we don't have the luxury of being physically located with our DUT). So exactly what are you doing? Please be precise.

wangzhuo1016 · Answer 27 · Tue Nov 22 2022 14:37:46 GMT+0800 (China Standard Time)

BTW, doing a ifdown/ifup on an interface is different from physically pulling the cable. To a certain point, it is also different from how we are reproducing the issue with mlxlink utility (we don't have the luxury of being physically located with our DUT). So exactly what are you doing? Please be precise.

1/ Start daos_server
2/ dmg storage format
3/ ifdown/ifup an interface of a rank A( can be both client and server)
4/mercury timeout or transport layer error
Actually, when network recovery from fault, communication between client and server is always abnormal.
e.g. rank A sends message to other normal ranks by using rxm_tsend, rxm_tsend doesn't return error, but rxm_ep_do_progress gets error work completion, and wc.status is 5(IBV_WC_WR_FLUSH_ERR ) or 12(IBV_WC_RETRY_EXC_ERR).

wangzhuo1016 · Answer 28 · Thu Nov 24 2022 17:54:22 GMT+0800 (China Standard Time)

BTW, doing a ifdown/ifup on an interface is different from physically pulling the cable. To a certain point, it is also different from how we are reproducing the issue with mlxlink utility (we don't have the luxury of being physically located with our DUT). So exactly what are you doing? Please be precise.

Hi, if you don't have extra devices to test these issues(ifdown/ifup interface and plug/unplug the network cable ), I can reproduce the isssue and provide you with related information and logs @chien-intel

Lukasz Dorau · Answer 29 · Tue Nov 29 2022 20:30:22 GMT+0800 (China Standard Time)

@xqcool I have tested your "To Reproduce" scenario for the following providers and time periods of unplugging the network cable:

ofi+verbs;ofi_rxm - at least 30 seconds

ofi+tcp;ofi_rxm - at least 30 minutes

ofi+sockets - at least 30 minutes

The behavior of all providers is consistent - when the retransmission timeout (different for different providers) has passed the mercury timeout or transport layer error occurs. The only difference is a time to error which we do not have control of at the libfabric level. In case of verbs it is only about 30 seconds (exactly 25.4 seconds), while in case of tcp and sockets it is much longer - after 30 minutes the same happens.

@xqcool I was wrong. I have decreased the value of the retransmission timeout for TCP to 25.4 seconds and the tcp provider sometimes manages to automatically reestablish a connection after plugging in the network cable.

wangzhuo1016 · Answer 30 · Fri Dec 02 2022 09:55:39 GMT+0800 (China Standard Time)

@wangzhuo1016 for the OFI level issue, may need @chien-intel to confirm if can be refined.

On the other hand, at DAOS level, you may test if can workaround it by a few other steps, for your steps - 1/ Start daos_server 2/ dmg storage format 3/ Unplug the network cable of an engine 4/ wait for 1 minute 5/ plug in the network cable

add these steps after step 5, "./dmg -o ./daos_control.yml system query -v" should be able to show that server rank be excluded after swim detects the network broken and evict it. stop and restart that server rank by command like - ./dmg -o ./daos_control.yml -d system stop --force --ranks= ./dmg -o ./daos_control.yml -d system start --ranks= and then reintegrate it back to system ./dmg -o ./daos_control.yml -d pool reintegrate --pool= --rank=

I'm sorry for the delay. I did test with the steps that you had requested, but communication between normal rank(as client) and restarted rank(as server).
When server rank is restarted and joined to daos cluster，It seems that the other normal rank still send rpc to the reseted rank with old rxm conn, but rxm conn is unavailible @liuxuezhao

Liu Xuezhao · Answer 31 · Fri Dec 02 2022 15:27:23 GMT+0800 (China Standard Time)

@wangzhuo1016 in your test, did you confirm that "./dmg -o ./daos_control.yml system query -v" show that server rank be excluded before restart the server rank?
For "It seems that the other normal rank still send rpc to the reseted rank with old rxm conn, but rxm conn is unavailible", what you observed is RPC timeout? is the timeout RPC between client and server, or between two server ranks? and the daos version you used is latest 2.0 branch?

wangzhuo1016 · Answer 32 · Fri Dec 02 2022 15:34:23 GMT+0800 (China Standard Time)

@wangzhuo1016 in your test, did you confirm that "./dmg -o ./daos_control.yml system query -v" show that server rank be excluded before restart the server rank? For "It seems that the other normal rank still send rpc to the reseted rank with old rxm conn, but rxm conn is unavailible", what you observed is RPC timeout? is the timeout RPC between client and server, or between two server ranks? and the daos version you used is latest 2.0 branch?

rank is excluded before I restart the server rank;
RPC is timeout between two server ranks;
daos version is release 2.0. And I also test daos release 2.0 with mercury 2.2.0-3 libfabric 1.15.1, the same happens @liuxuezhao

Liu Xuezhao · Answer 33 · Tue Dec 06 2022 11:05:36 GMT+0800 (China Standard Time)

@wangzhuo1016 probably the code you used with some difference, it should work if do exclude+restart+reint. Let me explain a little bit about how it works in daos code path - 1) swim detects the server down, pool service handles the RAS event "CRT_EVT_DEAD" handle_event() -> pool_svc_exclude_rank() -> ds_rsvc_request_map_dist(), then a daemon ULT called map_distd will invoke pool_svc_map_dist_cb to send the latest pool map to all members; each daos server will do ds_pool_tgt_map_update() -> update_pool_group() -> crt_group_secondary_modify() that will evict that DEAD rank's info from group. and 2) in re-integrate, the new rank will be added by similar code path - pool_svc_update_map() -> ds_rsvc_request_map_dist(), and then each server ds_pool_tgt_map_update() -> update_pool_group() -> crt_group_secondary_modify() to add the rank to group.
So after the "exclude + restart + reint" the new RPC communication will establish new connection after restart, you may check if the code path works or broken at your code base.
For DAOS-level problem we may discuss by Email, and let's keep this thread focus on OFI-level. Thanks.

Chien Tin Tung · Answer 34 · Fri Dec 23 2022 02:15:30 GMT+0800 (China Standard Time)

I'm looking at this issue. I do see an issue with verbs provider, but I really need you guys to confirm the TCP behavior.
Please run the test case with TCP provider after changing ipv4 tcp_retries2 value to 6.
You can change the value by running this command as root "sysctl -w net.ipv4.tcp_retries2=6" and verified the new value with "sysctl -n net.ipv4.tcp_retries2"
It would be helpful to have a simpler testcase without all the DAOS services. In issue 8314, someone suggested hg_example_rpc_* but you can't switch providers easily with that one. I've looked at hg_test_bw but so far I don't think it behaves the same. Perhaps I need to use CaRT level tests?

wangzhuo1016 · Answer 35 · Sat Dec 24 2022 10:35:46 GMT+0800 (China Standard Time)

I'm looking at this issue. I do see an issue with verbs provider, but I really need you guys to confirm the TCP behavior. Please run the test case with TCP provider after changing ipv4 tcp_retries2 value to 6. You can change the value by running this command as root "sysctl -w net.ipv4.tcp_retries2=6" and verified the new value with "sysctl -n net.ipv4.tcp_retries2" It would be helpful to have a simpler testcase without all the DAOS services. In issue 8314, someone suggested hg_example_rpc_* but you can't switch providers easily with that one. I've looked at hg_test_bw but so far I don't think it behaves the same. Perhaps I need to use CaRT level tests?

I'm sorry for delay reply. With these steps(including ifdown/ifup interface and plug/unplug the network cable), CaRT communication between daos ranks can recover.
Environment:
linux centos7, daos v2.0 release libfabric v1.14.0 release, mercury 2.1.0,rc， provider: ofi+sockets
ifdown/ifup

sysctl -w net.ipv4.tcp_retries2=6
dmg sto format
ifdown the network interface of a daos_server rank;
after 1min, ifup the interface

unplug/plug the network cable

sysctl -w net.ipv4.tcp_retries2=6
dmg sto format
unplug the network cable of a daos_server rank;
after 1min, plug the network cable
@chien-intel

Chien Tin Tung · Answer 36 · Mon May 01 2023 22:35:42 GMT+0800 (China Standard Time)

With PR 8858, when the verbs connection exceeds retry timeout, it will be close and on subsequent use of the RXM connection a new verbs QP will get created.