AleoNet / snarkOS

A Decentralized Operating System for ZK Applications

Home Page:http://snarkos.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] Abnormal disconnection caused by asynchronous Ping message

elderhammer opened this issue · comments

Steps to Reproduce

1.Receive a Pong message from the peer and create an asynchronous task to send a Ping message to the peer after PING_SLEEP_IN_SECS seconds.

2024-06-25T07:25:55.802800Z TRACE snarkos_node_router::inbound: Received 'Pong' from '127.0.0.1:4132'

/// Sleeps for a period and then sends a `Ping` message to the peer.
fn pong(&self, peer_ip: SocketAddr, _message: Pong) -> bool {
// Spawn an asynchronous task for the `Ping` request.
let self_ = self.clone();
tokio::spawn(async move {
// Sleep for the preset time before sending a `Ping` request.
tokio::time::sleep(Duration::from_secs(Self::PING_SLEEP_IN_SECS)).await;
// Check that the peer is still connected.
if self_.router().is_connected(&peer_ip) {
// Retrieve the block locators.
match self_.sync.get_block_locators() {
// Send a `Ping` message to the peer.
Ok(block_locators) => self_.send_ping(peer_ip, Some(block_locators)),
Err(e) => error!("Failed to get block locators - {e}"),
}
}
});
true
}

2.Periodic refresh of peers.

2024-06-25T07:26:07.870256Z  INFO snarkos_node_router::heartbeat: Disconnecting from '127.0.0.1:4132' (periodic refresh of peers)
2024-06-25T07:26:07.870268Z TRACE snarkos_node_router::outbound: Sending 'Disconnect' to '127.0.0.1:4132'

3.Receive a connection request from a peer

2024-06-25T07:26:15.775172Z DEBUG tcp{name="0"}: snarkos_node_tcp::tcp: Received a connection from 127.0.0.1:51296

4.Complete the handshake, and add the peer to the router.

2024-06-25T07:26:15.775550Z DEBUG tcp{name="0"}: snarkos_node_tcp::protocols::handshake: shaking hands with 127.0.0.1:51296 as the Responder
2024-06-25T07:26:15.775572Z DEBUG snarkos_node_router::handshake: Received a connection request from '127.0.0.1:51296'
2024-06-25T07:26:15.789750Z TRACE snarkos_node_router::handshake: Received 'ChallengeRequest' from '127.0.0.1:51296'
2024-06-25T07:26:15.790940Z TRACE snarkos_node_router::handshake: Sending 'ChallengeResponse' to '127.0.0.1:51296'
2024-06-25T07:26:15.791066Z TRACE snarkos_node_router::handshake: Sending 'ChallengeRequest' to '127.0.0.1:51296'
2024-06-25T07:26:15.801582Z TRACE snarkos_node_router::handshake: Received 'ChallengeResponse' from '127.0.0.1:51296'
2024-06-25T07:26:15.804149Z DEBUG tcp{name="0"}: snarkos_node_tcp::protocols::handshake: successfully handshaken with 127.0.0.1:51296

// Add the peer to the router.
self.insert_connected_peer(Peer::new(peer_ip, &peer_request), peer_addr);

5.Create asynchronous tasks for processing, reading, and writing

2024-06-25T07:26:15.804489Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::reading: spawned a task for processing messages from 127.0.0.1:51296

6.Asynchronous task sleep ends, send Ping message

2024-06-25T07:26:15.804961Z TRACE snarkos_node_router::outbound: Sending 'Ping' to '127.0.0.1:4132'

7.Because the peer has been added to the router, can_send returns true and the Ping message is sent. However, because the writing asynchronous task has not been created, the sending fails.

2024-06-25T07:26:15.804980Z  WARN snarkos_node_router::outbound: Failed to send 'Ping' to '127.0.0.1:4132': not connected
2024-06-25T07:26:15.804988Z DEBUG snarkos_node_router::outbound: Disconnecting from '127.0.0.1:4132' (unable to send)
2024-06-25T07:26:15.805060Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 127.0.0.1:51296
2024-06-25T07:26:15.805072Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

// Determine whether to send the message.
if !self.can_send(peer_ip, &message) {
return None;
}
// Resolve the listener IP to the (ambiguous) peer address.
let peer_addr = match self.router().resolve_to_ambiguous(&peer_ip) {
Some(peer_addr) => peer_addr,
None => {
warn!("Unable to resolve the listener IP address '{peer_ip}'");
return None;
}
};

8.The reading and writing asynchronous tasks are created later.

2024-06-25T07:26:15.805089Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::reading: spawned a task for reading messages from 127.0.0.1:51296
2024-06-25T07:26:15.805254Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::writing: spawned a task for writing messages to 127.0.0.1:51296

This will cause two problems:

  • Asynchronous tasks such as processing, reading, and writing are not terminated correctly
  • Because the peer believes that the connection is complete, it will reject subsequent connections from the current node
2024-06-25T07:26:32.872315Z DEBUG tcp{name="1"}: snarkos_node_tcp::tcp: Received a connection from 127.0.0.1:57797
2024-06-25T07:26:32.872716Z DEBUG tcp{name="1"}: snarkos_node_tcp::protocols::handshake: shaking hands with 127.0.0.1:57797 as the Responder
2024-06-25T07:26:32.872755Z DEBUG snarkos_node_router::handshake: Received a connection request from '127.0.0.1:57797'
2024-06-25T07:26:32.873158Z TRACE snarkos_node_router::handshake: Received 'ChallengeRequest' from '127.0.0.1:57797'
2024-06-25T07:26:32.873211Z ERROR tcp{name="1"}: snarkos_node_tcp::protocols::handshake: handshake with 127.0.0.1:57797 failed: Dropping connection request from '127.0.0.1:4202' (already connected)
2024-06-25T07:26:32.873252Z ERROR tcp{name="1"}: snarkos_node_tcp::tcp: Failed to connect with 127.0.0.1:57797: Dropping connection request from '127.0.0.1:4202' (already connected)

Expected Behavior

This shouldn't happen

Your Environment

snarkOS Version: 09aa62b

@ljedrz might be of interest to you

Very cool find; I think I know how to tackle this.

Ah yes, it seems that #2690 will need to be reopened; a workaround where we would abort ping tasks on disconnect would also work, but the real solution is to delay a peer's promotion to the connected status.

On testnet beta, this issue occurs frequently, caused by the UnconfirmedTransaction and UnconfirmedSolution messages.

2024-07-04T10:41:42.069597Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '89.116.29.150:4130': not connected
2024-07-04T10:41:42.069659Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 89.116.29.150:58433
2024-07-04T10:41:42.069667Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-04T13:19:33.127551Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '211.171.42.223:4130': not connected
2024-07-04T13:19:33.127618Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 211.171.42.223:55251
2024-07-04T13:19:33.127629Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-04T19:04:50.829665Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedSolution' to '157.66.54.66:4130': not connected
2024-07-04T19:04:50.829711Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 157.66.54.66:56847
2024-07-04T19:04:50.829719Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-04T23:09:06.395996Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedSolution' to '118.140.26.162:4130': not connected
2024-07-04T23:09:06.396038Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 118.140.26.162:47970
2024-07-04T23:09:06.396046Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)

2024-07-05T01:29:44.276617Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '104.237.2.224:4130': not connected      
2024-07-05T01:29:44.276679Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 104.237.2.224:4130                
2024-07-05T01:29:44.276691Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)  

2024-07-05T01:36:28.543874Z  WARN snarkos_node_router::outbound: Failed to send 'UnconfirmedTransaction' to '34.17.53.129:4130': not connected
2024-07-05T01:36:28.544005Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Failed to disconnect, was not connected to 34.17.53.129:4130
2024-07-05T01:36:28.544012Z  WARN snarkos_node_router: Disconnecting with fallback safety (report this to @ljedrz)
2024-07-05T01:36:32.919219Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:32.919232Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:37.919752Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:37.919782Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:42.920965Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:42.920990Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists
2024-07-05T01:36:47.921713Z  WARN tcp{name="0"}: snarkos_node_tcp::tcp: Already connected to 34.17.53.129:4130
2024-07-05T01:36:47.921728Z  WARN snarkos_node_router: Unable to connect to '34.17.53.129:4130' - entity already exists

#3332 should fix this.