[Bug] Block synchronization has race condition
elderhammer opened this issue Β· comments
π Bug Report
- First request 36228
2024-06-16T15:19:44.899310Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 37014, canon_height = 36189
2024-06-16T15:19:44.908326Z TRACE snarkos_node_sync::block_sync: Prepared 250 block requests
2024-06-16T15:19:44.987889Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BlockRequest 36225..36230' to '127.0.0.1:5001'
- Logs from 127.0.0.1:5001
2024-06-16T15:19:46.441947Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BlockResponse 36225..36230' to '127.0.0.1:5003'
- First received 36228
2024-06-16T15:19:55.936687Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BlockResponse 36225..36230' from '127.0.0.1:5001'
- Sync logic and request logic are executed concurrently
2024-06-16T15:19:58.168983Z INFO snarkos_node_bft::sync: Syncing the ledger to block 36228...
2024-06-16T15:19:58.191790Z DEBUG snarkos_node_bft::gateway: Deserializing blocks from 127.0.0.1:5004 takes time: 780 ms
2024-06-16T15:19:58.268123Z TRACE snarkos_node_sync::block_sync: Block request 36198 has timed out: is_time_passed = false, is_request_incomplete = true, is_obsolete = true
2024-06-16T15:19:58.268264Z TRACE snarkos_node_sync::block_sync: Updating is_block_synced: greatest_peer_height = 37014, canon_height = 36227
2024-06-16T15:19:58.268431Z TRACE snarkos_node_sync::block_sync: Prepared 1 block requests
2024-06-16T15:19:58.268476Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BlockRequest 36228' to '127.0.0.1:5001'
2024-06-16T15:19:58.268594Z TRACE tcp{name="0"}: snarkos_node_tcp::protocols::writing: sent 14B to 127.0.0.1:43080
2024-06-16T15:19:58.352788Z INFO snarkos_node_bft_ledger_service::ledger:
Advanced to block 36228 at round 95522 - ab1qqsn38wapav0y4fnwkqyur0gn9vdev54rwanxml3kry8y2g9dvrssnspa5
- Request entry removed due to obsolescence
2024-06-16T15:20:04.360013Z TRACE snarkos_node_sync::block_sync: Block request 36228 has timed out: is_time_passed = false, is_request_incomplete = true, is_obsolete = true
- But then the node received a response from peer
2024-06-16T15:21:00.660685Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BlockResponse 36228' from '127.0.0.1:5001'
- Because 36228 cannot be found in requests, an error is reported and all requests are cleared
2024-06-16T15:21:00.771918Z TRACE snarkos_node_sync::block_sync: Block sync is removing all block requests to peer 127.0.0.1:5001...
2024-06-16T15:21:00.771927Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36445
2024-06-16T15:21:00.771933Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36446
2024-06-16T15:21:00.771936Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36447
2024-06-16T15:21:00.771940Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36448
2024-06-16T15:21:00.771951Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36449
2024-06-16T15:21:00.771955Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36460
2024-06-16T15:21:00.771959Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36461
2024-06-16T15:21:00.771967Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36462
2024-06-16T15:21:00.771976Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36463
2024-06-16T15:21:00.771980Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36464
2024-06-16T15:21:00.771985Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36505
2024-06-16T15:21:00.771989Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36506
2024-06-16T15:21:00.771993Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36507
2024-06-16T15:21:00.771997Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36508
2024-06-16T15:21:00.772001Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36509
2024-06-16T15:21:00.772004Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36510
2024-06-16T15:21:00.772008Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36511
2024-06-16T15:21:00.772012Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36512
2024-06-16T15:21:00.772016Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36513
2024-06-16T15:21:00.772019Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36514
2024-06-16T15:21:00.772023Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36520
2024-06-16T15:21:00.772027Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36521
2024-06-16T15:21:00.772041Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36522
2024-06-16T15:21:00.772046Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36523
2024-06-16T15:21:00.772049Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36524
2024-06-16T15:21:00.772053Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36525
2024-06-16T15:21:00.772056Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36526
2024-06-16T15:21:00.772060Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36527
2024-06-16T15:21:00.772064Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36528
2024-06-16T15:21:00.772068Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36544
2024-06-16T15:21:00.772072Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36545
2024-06-16T15:21:00.772075Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36546
2024-06-16T15:21:00.772078Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36547
2024-06-16T15:21:00.772082Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36548
2024-06-16T15:21:00.772086Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36554
2024-06-16T15:21:00.772094Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36555
2024-06-16T15:21:00.772098Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36556
2024-06-16T15:21:00.772102Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36557
2024-06-16T15:21:00.772105Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36558
2024-06-16T15:21:00.772135Z WARN snarkos_node_bft::gateway: Unable to process block response from '127.0.0.1:5001' - The sync pool did not request block 36228
- A chain reaction
2024-06-16T15:21:03.526259Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BlockResponse 36520..36525' from '127.0.0.1:5001'
2024-06-16T15:21:04.374873Z TRACE snarkos_node_sync::block_sync: Block sync is removing all block requests to peer 127.0.0.1:5001...
2024-06-16T15:21:04.374883Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36440
2024-06-16T15:21:04.374895Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36441
2024-06-16T15:21:04.374906Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36442
2024-06-16T15:21:04.374912Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36443
2024-06-16T15:21:04.374918Z TRACE snarkos_node_sync::block_sync: Removed block request timestamp for 127.0.0.1:5001 at height 36444
2024-06-16T15:21:04.375118Z WARN snarkos_node_bft::gateway: Unable to process block response from '127.0.0.1:5001' - The sync pool did not request block 36520
Steps to Reproduce
I haven't tried to reproduce the issue, but from the logs it looks like any block request expiring could be causing this.
Expected Behavior
Your Environment
snarkOS Version: cf83035
If a BlockRequest
times out because of being obsolete, that means the node has already moved past that block height, and it's correct to remove it from requests
and request_timestamps
. If a BlockResponse
still comes in for that block, indeed all requests to that peer are removed (spam protection). This can happen if the blocks are very big, deserialization takes too long and the request is re-sent. Not sure what could be done except trying to make deserialization faster, but it's slow because of the heavy crypto involved.
If a
BlockRequest
times out because of being obsolete, that means the node has already moved past that block height, and it's correct to remove it fromrequests
andrequest_timestamps
. If aBlockResponse
still comes in for that block, indeed all requests to that peer are removed (spam protection). This can happen if the blocks are very big, deserialization takes too long and the request is re-sent. Not sure what could be done except trying to make deserialization faster, but it's slow because of the heavy crypto involved.
I will add more logs later.
Yes, it can be confirmed that the deserialization took too long and the synchronization logic did not process the BlockResponse that had arrived in time. And because the synchronization logic (remove_block_response) and the request logic (check_block_request) were executed concurrently, the BlockRequest was resent.
This caused a race condition in the synchronization of subsequent blocks, that is, requests were cleared because they could not be found in the sync pool, and at the same time try_advance_next updated requests and requested blocks from the peer.
This made the synchronization process of the node inefficient and uncertain.
I have two questions:
- Will the chain reaction continue without malicious behavior?
- If the node receives (actually not requested) BlockResponse from the malicious peer during the synchronization process, will its synchronization logic get stuck?
- we've changed the liveness heartbeat of the peer to be reset on any message from the peer, and together with the async deser and handling of the BlockResponses the node should eventually be able to continue, though too much data could be requested and discarded
- I don't think so as only requests from that specific peer will be removed
- we've changed the liveness heartbeat of the peer to be reset on any message from the peer, and together with the async deser and handling of the BlockResponses the node should eventually be able to continue, though too much data could be requested and discarded
- I don't think so as only requests from that specific peer will be removed
I tested:
- Confirmed that the chain reaction would not last forever
- The synchronization logic would not get stuck even if attacked
Thank you for your patient reply.