[Bug] Abnormal disconnection caused by asynchronous Ping message
elderhammer opened this issue · comments
Steps to Reproduce
1.Receive a Pong message from the peer and create an asynchronous task to send a Ping message to the peer after PING_SLEEP_IN_SECS seconds.
2024-06-25T07:25:55.802800Z TRACE snarkos_node_router::inbound: Received 'Pong' from '127.0.0.1:4132'
snarkOS/node/src/client/router.rs
Lines 248 to 266 in 09aa62b
2.Periodic refresh of peers.
2024-06-25T07:26:07.870256Z INFO snarkos_node_router::heartbeat: Disconnecting from '127.0.0.1:4132' (periodic refresh of peers)
2024-06-25T07:26:07.870268Z TRACE snarkos_node_router::outbound: Sending 'Disconnect' to '127.0.0.1:4132'
3.Receive a connection request from a peer
2024-06-25T07:26:15.775172Z DEBUG tcp{name="0"}: snarkos_node_tcp::tcp: Received a connection from 127.0.0.1:51296
4.Complete the handshake, and add the peer to the router.
2024-06-25T07:26:15.775550Z DEBUG tcp{name="0"}: snarkos_node_tcp::protocols::handshake: shaking hands with 127.0.0.1:51296 as the Responder
2024-06-25T07:26:15.775572Z DEBUG snarkos_node_router::handshake: Received a connection request from '127.0.0.1:51296'
2024-06-25T07:26:15.789750Z TRACE snarkos_node_router::handshake: Received 'ChallengeRequest' from '127.0.0.1:51296'
2024-06-25T07:26:15.790940Z TRACE snarkos_node_router::handshake: Sending 'ChallengeResponse' to '127.0.0.1:51296'
2024-06-25T07:26:15.791066Z TRACE snarkos_node_router::handshake: Sending 'ChallengeRequest' to '127.0.0.1:51296'
2024-06-25T07:26:15.801582Z TRACE snarkos_node_router::handshake: Received 'ChallengeResponse' from '127.0.0.1:51296'
2024-06-25T07:26:15.804149Z DEBUG tcp{name="0"}: snarkos_node_tcp::protocols::handshake: successfully handshaken with 127.0.0.1:51296
snarkOS/node/router/src/handshake.rs
Lines 275 to 276 in 09aa62b
5.Create asynchronous tasks for processing, reading, and writing
2024-06-25T07:26:15.804489Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::reading: spawned a task for processing messages from 127.0.0.1:51296
6.Asynchronous task sleep ends, send Ping message
2024-06-25T07:26:15.804961Z TRACE snarkos_node_router::outbound: Sending 'Ping' to '127.0.0.1:4132'
7.Because the peer has been added to the router, can_send returns true and the Ping message is sent. However, because the writing asynchronous task has not been created, the sending fails.
2024-06-25T07:26:15.804980Z WARN snarkos_node_router::outbound: Failed to send 'Ping' to '127.0.0.1:4132': not connected
2024-06-25T07:26:15.804988Z DEBUG snarkos_node_router::outbound: Disconnecting from '127.0.0.1:4132' (unable to send)
2024-06-25T07:26:15.805060Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 127.0.0.1:51296
2024-06-25T07:26:15.805072Z WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
snarkOS/node/router/src/outbound.rs
Lines 48 to 59 in 09aa62b
8.The reading and writing asynchronous tasks are created later.
2024-06-25T07:26:15.805089Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::reading: spawned a task for reading messages from 127.0.0.1:51296
2024-06-25T07:26:15.805254Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::writing: spawned a task for writing messages to 127.0.0.1:51296
This will cause two problems:
- Asynchronous tasks such as processing, reading, and writing are not terminated correctly
- Because the peer believes that the connection is complete, it will reject subsequent connections from the current node
2024-06-25T07:26:32.872315Z DEBUG tcp{name="1"}: snarkos_node_tcp::tcp: Received a connection from 127.0.0.1:57797
2024-06-25T07:26:32.872716Z DEBUG tcp{name="1"}: snarkos_node_tcp::protocols::handshake: shaking hands with 127.0.0.1:57797 as the Responder
2024-06-25T07:26:32.872755Z DEBUG snarkos_node_router::handshake: Received a connection request from '127.0.0.1:57797'
2024-06-25T07:26:32.873158Z TRACE snarkos_node_router::handshake: Received 'ChallengeRequest' from '127.0.0.1:57797'
2024-06-25T07:26:32.873211Z ERROR tcp{name="1"}: snarkos_node_tcp::protocols::handshake: handshake with 127.0.0.1:57797 failed: Dropping connection request from '127.0.0.1:4202' (already connected)
2024-06-25T07:26:32.873252Z ERROR tcp{name="1"}: snarkos_node_tcp::tcp: Failed to connect with 127.0.0.1:57797: Dropping connection request from '127.0.0.1:4202' (already connected)
Expected Behavior
This shouldn't happen
Your Environment
snarkOS Version: 09aa62b
Very cool find; I think I know how to tackle this.
Ah yes, it seems that #2690 will need to be reopened; a workaround where we would abort ping tasks on disconnect would also work, but the real solution is to delay a peer's promotion to the connected
status.
On testnet beta, this issue occurs frequently, caused by the UnconfirmedTransaction and UnconfirmedSolution messages.
2024-07-04T10:41:42.069597Z WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '89.116.29.150:4130': not connected
2024-07-04T10:41:42.069659Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 89.116.29.150:58433
2024-07-04T10:41:42.069667Z WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
2024-07-04T13:19:33.127551Z WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '211.171.42.223:4130': not connected
2024-07-04T13:19:33.127618Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 211.171.42.223:55251
2024-07-04T13:19:33.127629Z WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
2024-07-04T19:04:50.829665Z WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedSolution' to '157.66.54.66:4130': not connected
2024-07-04T19:04:50.829711Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 157.66.54.66:56847
2024-07-04T19:04:50.829719Z WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
2024-07-04T23:09:06.395996Z WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedSolution' to '118.140.26.162:4130': not connected
2024-07-04T23:09:06.396038Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 118.140.26.162:47970
2024-07-04T23:09:06.396046Z WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
2024-07-05T01:29:44.276617Z WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '104.237.2.224:4130': not connected
2024-07-05T01:29:44.276679Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 104.237.2.224:4130
2024-07-05T01:29:44.276691Z WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
2024-07-05T01:36:28.543874Z WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '34.17.53.129:4130': not connected
2024-07-05T01:36:28.544005Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 34.17.53.129:4130
2024-07-05T01:36:28.544012Z WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
2024-07-05T01:36:32.919219Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:32.919232Z WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:37.919752Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:37.919782Z WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:42.920965Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:42.920990Z WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:47.921713Z WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:47.921728Z WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists