ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

daos file system plug network cable timeout link can not

xqcool opened this issue · comments

Describe the bug
After unplugging the network cable of the engine port of daos file system, wait for 1 minute, plug in the network cable, the network has been timeout or unavailable

To Reproduce
1/ Start daos_server
2/ dmg storage format
3/ Unplug the network cable of an engine
4/ wait for 1 minute
5/ plug in the network cable
6/mercury timeout or transport layer error

Expected behavior
Plug in the network cable network recovery

Output
If applicable, add output to help explain your problem. (e.g. backtrace, debug logs)

Environment:
centos7.9

Additional context
Add any other context about the problem here.
log

DAOS: verbs;rxm

When running over verbs, please be specific on type of network controller used in the test case. This is relevant, especially with link events, the device driver for the network controller is handling that situation and may not work the same from drivers for other verbs capable device.
If possible, provide modinfo for the device driver or infiniband software installed in the system (from CentOS 7.0 or MOFED for example).

When running over verbs, please be specific on type of network controller used in the test case. This is relevant, especially with link events, the device driver for the network controller is handling that situation and may not work the same from drivers for other verbs capable device. If possible, provide modinfo for the device driver or infiniband software installed in the system (from CentOS 7.0 or MOFED for example).

mlx5

Using 'ofi+sockets' network type, the program works fine

daos version 2.02

When running over verbs, please be specific on type of network controller used in the test case. This is relevant, especially with link events, the device driver for the network controller is handling that situation and may not work the same from drivers for other verbs capable device. If possible, provide modinfo for the device driver or infiniband software installed in the system (from CentOS 7.0 or MOFED for example).

log

1 minute is long enough for a Verbs op to time out. Does DAOS/Mercury retry after timeout?

1 minute is long enough for a Verbs op to time out. Does DAOS/Mercury retry after timeout?

Always in timeout state, retry only dping message, but will time out, need special retry (similar to TCP reconnection)?

But there is no problem with 'ofi + sockets'

I need "modinfo mlx5" not ibstat. That info is necessary so I can look at the correct version of the source code to help you. Without correct and detailed issue description it is difficult to narrow down the problem.

I need "modinfo mlx5" not ibstat. That info is necessary so I can look at the correct version of the source code to help you. Without correct and detailed issue description it is difficult to narrow down the problem.

Hi, checked the version information, daos version number: v2.0 (c45320fb33)
mercury version number: v2.1.0rc4 (43abc30462)
fabric version number: v1.14.0 (119622b)
1

1 minute is long enough for a Verbs op to time out. Does DAOS/Mercury retry after timeout?

Unplugging the network cable will prompt a send error, and plugging the cable keeps prompting a timeout. There should be a retry, but sending is abnormal.

But there is no problem with 'ofi + sockets'
Actually the behavior is the same if you wait long enough. Disconnect the cable for 16 minutes and same thing should happen.
You are hitting retransmission timeout. With verbs and mlx5, the retry_cnt attribute on the QP (connection) is set to 7 (max and default value) which according to the table in this webpage (https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html) is 25.4 seconds. So vary the time to cable reconnection and see if this holds.

did you get the same behavior with "ofi+sockets" after waiting 16 minutes?

But there is no problem with 'ofi + sockets'
Actually the behavior is the same if you wait long enough. Disconnect the cable for 16 minutes and same thing should happen.
You are hitting retransmission timeout. With verbs and mlx5, the retry_cnt attribute on the QP (connection) is set to 7 (max and default value) which according to the table in this webpage (https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html) is 25.4 seconds. So vary the time to cable reconnection and see if this holds.

I changed the number of retries to be the same as tcp, the same operation can be recovered within 16 minutes, sockets can be recovered, but verbs; rxm can not be recovered

did you get the same behavior with "ofi+sockets" after waiting 16 minutes?

Using ofi+sockets mode, after more than 20 minutes, the service of plugging in the Internet cable will return to normal.

Is there any progress about this issue? I'm also facing the same problem.

@wangzhuo1016 just confirm did you test with latest libfabric + mercury? (or use latest daos master branch code)

@wangzhuo1016 Also did you try with ofi+tcp;ofi_rxm ?

@wangzhuo1016 just confirm did you test with latest libfabric + mercury? (or use latest daos master branch code)

daos v2.0 with libfabric v1.15.1 + mercury 2.2.0-3 , roce v2, provider:verbs+rxm, daos can't recover to normal

@wangzhuo1016 for the OFI level issue, may need @chien-intel to confirm if can be refined.

On the other hand, at DAOS level, you may test if can workaround it by a few other steps, for your steps -
1/ Start daos_server
2/ dmg storage format
3/ Unplug the network cable of an engine
4/ wait for 1 minute
5/ plug in the network cable

add these steps after step 5,
"./dmg -o ./daos_control.yml system query -v" should be able to show that server rank be excluded after swim detects the network broken and evict it.
stop and restart that server rank by command like -
./dmg -o ./daos_control.yml -d system stop --force --ranks=
./dmg -o ./daos_control.yml -d system start --ranks=
and then reintegrate it back to system
./dmg -o ./daos_control.yml -d pool reintegrate --pool= --rank=

But there is no problem with 'ofi + sockets'
Actually the behavior is the same if you wait long enough. Disconnect the cable for 16 minutes and same thing should happen.
You are hitting retransmission timeout. With verbs and mlx5, the retry_cnt attribute on the QP (connection) is set to 7 (max and default value) which according to the table in this webpage (https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html) is 25.4 seconds. So vary the time to cable reconnection and see if this holds.

@xqcool I have tested your "To Reproduce" scenario for the following providers and time periods of unplugging the network cable:

  • ofi+verbs;ofi_rxm - at least 30 seconds
  • ofi+tcp;ofi_rxm - at least 30 minutes
  • ofi+sockets - at least 30 minutes

The behavior of all providers is consistent - when the retransmission timeout (different for different providers) has passed the mercury timeout or transport layer error occurs. The only difference is a time to error which we do not have control of at the libfabric level. In case of verbs it is only about 30 seconds (exactly 25.4 seconds), while in case of tcp and sockets it is much longer - after 30 minutes the same happens.

I have tested it with the DAOS v2.2.0 built from source (commit d2a1f2790c946659c9398926254e6203fd957b7c).

But there is no problem with 'ofi + sockets'
Actually the behavior is the same if you wait long enough. Disconnect the cable for 16 minutes and same thing should happen.
You are hitting retransmission timeout. With verbs and mlx5, the retry_cnt attribute on the QP (connection) is set to 7 (max and default value) which according to the table in this webpage (https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html) is 25.4 seconds. So vary the time to cable reconnection and see if this holds.

did you get the same behavior with "ofi+sockets" after waiting 16 minutes?

roce v2 , provider: ofi+verbs;ofi_rxm, when the network recovery from fault(ifdown/ifup network interface ), I call ibv_query_qp to get rdma qp'state , and find the state is IBV_QOS_ERR, but the libfabric level doesn't prevent this case from spreading .So when network recovery from fault, qp is always in a error state, and that causes mesages timeout in mercury and daos cart level @chien-intel

I had to get @ldorau to help me recreate the issue with all 3 providers to get the complete picture not just verbs. We are discussing the next step for this issue.

BTW, doing a ifdown/ifup on an interface is different from physically pulling the cable. To a certain point, it is also different from how we are reproducing the issue with mlxlink utility (we don't have the luxury of being physically located with our DUT). So exactly what are you doing? Please be precise.

BTW, doing a ifdown/ifup on an interface is different from physically pulling the cable. To a certain point, it is also different from how we are reproducing the issue with mlxlink utility (we don't have the luxury of being physically located with our DUT). So exactly what are you doing? Please be precise.

1/ Start daos_server
2/ dmg storage format
3/ ifdown/ifup an interface of a rank A( can be both client and server)
4/mercury timeout or transport layer error
Actually, when network recovery from fault, communication between client and server is always abnormal.
e.g. rank A sends message to other normal ranks by using rxm_tsend, rxm_tsend doesn't return error, but rxm_ep_do_progress gets error work completion, and wc.status is 5(IBV_WC_WR_FLUSH_ERR ) or 12(IBV_WC_RETRY_EXC_ERR).

BTW, doing a ifdown/ifup on an interface is different from physically pulling the cable. To a certain point, it is also different from how we are reproducing the issue with mlxlink utility (we don't have the luxury of being physically located with our DUT). So exactly what are you doing? Please be precise.

Hi, if you don't have extra devices to test these issues(ifdown/ifup interface and plug/unplug the network cable ), I can reproduce the isssue and provide you with related information and logs @chien-intel

@xqcool I have tested your "To Reproduce" scenario for the following providers and time periods of unplugging the network cable:

  • ofi+verbs;ofi_rxm - at least 30 seconds
  • ofi+tcp;ofi_rxm - at least 30 minutes
  • ofi+sockets - at least 30 minutes

The behavior of all providers is consistent - when the retransmission timeout (different for different providers) has passed the mercury timeout or transport layer error occurs. The only difference is a time to error which we do not have control of at the libfabric level. In case of verbs it is only about 30 seconds (exactly 25.4 seconds), while in case of tcp and sockets it is much longer - after 30 minutes the same happens.

@xqcool I was wrong. I have decreased the value of the retransmission timeout for TCP to 25.4 seconds and the tcp provider sometimes manages to automatically reestablish a connection after plugging in the network cable.

@wangzhuo1016 for the OFI level issue, may need @chien-intel to confirm if can be refined.

On the other hand, at DAOS level, you may test if can workaround it by a few other steps, for your steps - 1/ Start daos_server 2/ dmg storage format 3/ Unplug the network cable of an engine 4/ wait for 1 minute 5/ plug in the network cable

add these steps after step 5, "./dmg -o ./daos_control.yml system query -v" should be able to show that server rank be excluded after swim detects the network broken and evict it. stop and restart that server rank by command like - ./dmg -o ./daos_control.yml -d system stop --force --ranks= ./dmg -o ./daos_control.yml -d system start --ranks= and then reintegrate it back to system ./dmg -o ./daos_control.yml -d pool reintegrate --pool= --rank=

I'm sorry for the delay. I did test with the steps that you had requested, but communication between normal rank(as client) and restarted rank(as server).
When server rank is restarted and joined to daos cluster,It seems that the other normal rank still send rpc to the reseted rank with old rxm conn, but rxm conn is unavailible @liuxuezhao

@wangzhuo1016 in your test, did you confirm that "./dmg -o ./daos_control.yml system query -v" show that server rank be excluded before restart the server rank?
For "It seems that the other normal rank still send rpc to the reseted rank with old rxm conn, but rxm conn is unavailible", what you observed is RPC timeout? is the timeout RPC between client and server, or between two server ranks? and the daos version you used is latest 2.0 branch?

@wangzhuo1016 in your test, did you confirm that "./dmg -o ./daos_control.yml system query -v" show that server rank be excluded before restart the server rank? For "It seems that the other normal rank still send rpc to the reseted rank with old rxm conn, but rxm conn is unavailible", what you observed is RPC timeout? is the timeout RPC between client and server, or between two server ranks? and the daos version you used is latest 2.0 branch?

  1. rank is excluded before I restart the server rank;
  2. RPC is timeout between two server ranks;
  3. daos version is release 2.0. And I also test daos release 2.0 with mercury 2.2.0-3 libfabric 1.15.1, the same happens @liuxuezhao

@wangzhuo1016 probably the code you used with some difference, it should work if do exclude+restart+reint. Let me explain a little bit about how it works in daos code path - 1) swim detects the server down, pool service handles the RAS event "CRT_EVT_DEAD" handle_event() -> pool_svc_exclude_rank() -> ds_rsvc_request_map_dist(), then a daemon ULT called map_distd will invoke pool_svc_map_dist_cb to send the latest pool map to all members; each daos server will do ds_pool_tgt_map_update() -> update_pool_group() -> crt_group_secondary_modify() that will evict that DEAD rank's info from group. and 2) in re-integrate, the new rank will be added by similar code path - pool_svc_update_map() -> ds_rsvc_request_map_dist(), and then each server ds_pool_tgt_map_update() -> update_pool_group() -> crt_group_secondary_modify() to add the rank to group.
So after the "exclude + restart + reint" the new RPC communication will establish new connection after restart, you may check if the code path works or broken at your code base.
For DAOS-level problem we may discuss by Email, and let's keep this thread focus on OFI-level. Thanks.

I'm looking at this issue. I do see an issue with verbs provider, but I really need you guys to confirm the TCP behavior.
Please run the test case with TCP provider after changing ipv4 tcp_retries2 value to 6.
You can change the value by running this command as root "sysctl -w net.ipv4.tcp_retries2=6" and verified the new value with "sysctl -n net.ipv4.tcp_retries2"
It would be helpful to have a simpler testcase without all the DAOS services. In issue 8314, someone suggested hg_example_rpc_* but you can't switch providers easily with that one. I've looked at hg_test_bw but so far I don't think it behaves the same. Perhaps I need to use CaRT level tests?

I'm looking at this issue. I do see an issue with verbs provider, but I really need you guys to confirm the TCP behavior. Please run the test case with TCP provider after changing ipv4 tcp_retries2 value to 6. You can change the value by running this command as root "sysctl -w net.ipv4.tcp_retries2=6" and verified the new value with "sysctl -n net.ipv4.tcp_retries2" It would be helpful to have a simpler testcase without all the DAOS services. In issue 8314, someone suggested hg_example_rpc_* but you can't switch providers easily with that one. I've looked at hg_test_bw but so far I don't think it behaves the same. Perhaps I need to use CaRT level tests?

I'm sorry for delay reply. With these steps(including ifdown/ifup interface and plug/unplug the network cable), CaRT communication between daos ranks can recover.
Environment:
linux centos7, daos v2.0 release libfabric v1.14.0 release, mercury 2.1.0,rc, provider: ofi+sockets
ifdown/ifup

  1. sysctl -w net.ipv4.tcp_retries2=6
  2. dmg sto format
  3. ifdown the network interface of a daos_server rank;
  4. after 1min, ifup the interface

unplug/plug the network cable

  1. sysctl -w net.ipv4.tcp_retries2=6
  2. dmg sto format
  3. unplug the network cable of a daos_server rank;
  4. after 1min, plug the network cable
    @chien-intel

With PR 8858, when the verbs connection exceeds retry timeout, it will be close and on subsequent use of the RXM connection a new verbs QP will get created.