Transaction that is added to a node's pool may not be propagated to the pools of its peer nodes

Question

Transaction that is added to a node's pool may not be propagated to the pools of its peer nodes

nano-adhara opened this issue 3 months ago · comments

Introduction

In certain situations, a transaction that is added to a node's pool (of sequenced type) may not be propagated to the pools of its peer nodes, resulting in the transaction being present only in the transaction receptor node's pool rather than being distributed across all nodes' pools.

Summary

We are using the following five nodes setup for a private Besu network with QBFT protocol:

A non-validator node that doesn’t participate in the consensus protocol.
One node acting as a validator.

Users connect to the non-validator node to send transactions, and it propagates them to the validator so that they are included in new blocks.

Starting from a scenario where all the nodes have a transaction (let’s call it “tx_a”) in the pool, as shown in the first image.

If the transaction is dropped from the pool of the nodes for any reason, the pools of the nodes are empty as we can see in the second image.

Then, if the transaction is sent again to the non-validator, it is added to its pool, but it is not added to the validator pool as we can see in the third image.

Detailed description

We have observed 2 problematic scenarios related to how the nodes handle their caches:

When a client sends a transaction to the non-validator node, if this node dropped the transaction before, the node doesn't propagate the transaction to the peers in the network. If the transaction is not propagated to a validator, it will never be mined.
When a peer propagates a transaction to the validator node, if this node has already received the transaction from a peer, the transaction is not promoted to the transactions pool. If the transaction is not promoted to the transactions pool, it will never be mined.

Both scenarios are problematic if the transaction is valid but the node dropped it from the pool before mining the transaction for any reason, e.g. exceeding tx-pool-retention-hours limit.

We were able to add that transaction again to the validator’s pool only by restarting all the nodes and sending the transaction again.

It seems to be a bug by which internal caches are not updated properly when transactions are dropped/evicted from the pool, not allowing those transactions to be added into new blocks as they are not propagated by the non-validator node and they are rejected by the validator as well.

Versions

This issue occurs in the latest version of the system, as of the time of this writing (v24.3.3). We believe that this occurs in previous versions as well.

Steps to reproduce

This is an explanation of how we were able to reproduce the bug:

1 . Set up at least one non-validator node and at least one validator node with a QBFT network configuration.
3. Send a new transaction using a higher nonce than the expected one, so a nonce gap is created, and that transaction will remain in the pool of both the non-validator and validator nodes.
4. Force the transaction to be dropped from the pool (for example waiting for the tx-pool-retention-hours to expire).
5. Resend the transaction to the non-validator node and you will be able to check that it won’t appear in the validator’s pool (as it does not get propagated by the non-validator nor accepted by the validator because it gets filtered due to be categorized as an already seen transaction).

Expected behavior

✅ = already happening
❌ = not happening

The user sends a transaction, which is received by the non-validator node and added to the pool ✅
The transaction is propagated to the validator nodes. ✅
The validator nodes accept the transaction and add it to their pool. ✅
Every node drops the transaction from its pool. ✅
The user resends the transaction, which is received by the non-validator node and added to the pool. ✅
The same as step 2. ❌
The same as step 3. ❌

Nodes execution arguments

--tx-pool-max-size=5
--tx-pool=sequenced
--tx-pool-limit-by-account-percentage=1

Nodes config

Non-validator node

# Network
p2p-host="127.0.0.1"
p2p-port=1232
max-peers=42

rpc-http-enabled=true
rpc-http-api=["ETH","NET","WEB3","IBFT","QBFT","TXPOOL","ADMIN"]

host-whitelist=["*"]
rpc-http-cors-origins=["all"]

rpc-http-host="0.0.0.0"
rpc-http-port=8545

rpc-ws-enabled=true
rpc-ws-host="0.0.0.0"
rpc-ws-port=30303

# Mining
miner-enabled=true
miner-coinbase="0xfe3b557e8fb62b89f4916b721be55ceb828dbd73"

min-gas-price="0"
revert-reason-enabled=true

metrics-category=[ "ETHEREUM", "BLOCKCHAIN","EXECUTORS","JVM","NETWORK","PEERS","PROCESS","KVSTORE_ROCKSDB","KVSTORE_ROCKSDB_STATS","RPC","SYNCHRONIZER", "TRANSACTION_POOL" ]
metrics-enabled=true
metrics-host="0.0.0.0"
metrics-port=9095

Validator node config

# Network
p2p-host="127.0.0.1"
p2p-port=1234
max-peers=42

rpc-http-enabled=true
rpc-http-api=["ETH","NET","WEB3","IBFT","QBFT","TXPOOL"]

host-whitelist=["*"]
rpc-http-cors-origins=["all"]

rpc-http-host="0.0.0.0"
rpc-http-port=8585

rpc-ws-enabled=true
rpc-ws-host="0.0.0.0"
rpc-ws-port=30305

# Mining
miner-enabled=true
miner-coinbase="0xfe3b557e8fb62b89f4916b721be55ceb828dbd73"

min-gas-price="0"
revert-reason-enabled=true

metrics-category=[ "ETHEREUM", "BLOCKCHAIN","EXECUTORS","JVM","NETWORK","PEERS","PROCESS","KVSTORE_ROCKSDB","KVSTORE_ROCKSDB_STATS","RPC","SYNCHRONIZER", "TRANSACTION_POOL" ]
metrics-enabled=true
metrics-host="0.0.0.0"
metrics-port=9097

genesis.json

{
 "config": {
   "muirGlacierBlock": 0,
   "chainId": 44844,
   "contractSizeLimit": 2147483647,
   "qbft": {
     "blockperiodseconds": 1,
     "epochlength": 30000,
     "requesttimeoutseconds": 10
   }
 },
 "nonce": "0x0",
 "timestamp": "0x58ee40ba",
 "gasLimit": "0x5F5E100",
 "difficulty": "0x1",
 "mixHash": "0x63746963616c2062797a616e74696e65206661756c7420746f6c6572616e6365",
 "coinbase": "0x0000000000000000000000000000000000000000",
 "alloc": {
   "fe3b557e8fb62b89f4916b721be55ceb828dbd73": {
     "privateKey": "8f2a55949038a9610f50fb23b5883af3b4ecb3c3bb792cbcefbd1542c692be63",
     "comment": "private key and this comment are ignored.  In a real chain, the private key should NOT be stored",
     "balance": "0xad78ebc5ac6200000"
   },
   "627306090abaB3A6e1400e9345bC60c78a8BEf57": {
     "privateKey": "c87509a1c067bbde78beb793e6fa76530b6382a4c0241e5e4a9ec0a0f44dc0d3",
     "comment": "private key and this comment are ignored.  In a real chain, the private key should NOT be stored",
     "balance": "90000000000000000000000"
   },
   "f17f52151EbEF6C7334FAD080c5704D77216b732": {
     "privateKey": "ae6ae8e5ccbfb04590405997ee2d52d2b330726137b875053c36d94e974d162f",
     "comment": "private key and this comment are ignored.  In a real chain, the private key should NOT be stored",
     "balance": "90000000000000000000000"
   }
 },
 "extraData": "0xf87aa00000000000000000000000000000000000000000000000000000000000000000f85494792fc5093a85bd8fb52c781aefee7da96d2180cf9414275b2f4cefb4c72f12ca32ce0044578923e1b694fcbc96c1e8a673b7cdc333c6687f07fa2c28befe94b5ec93ab0a6ad8f8c0e404e3225b16cb2ea23a1ec080c0"
}

nano-adhara · Answer 1 · Fri Jun 07 2024 17:57:00 GMT+0800 (China Standard Time)

I added the node execution arguments into the issue. This is using the sequenced node pool

Fabio Di Fabio · Answer 2 · Mon Jun 10 2024 18:55:11 GMT+0800 (China Standard Time)

Hi @nano-adhara, I see where the issue is coming from, Besu has caches to remember txs exchanges with other peers, basically because we want to avoid to resend a tx multiple time to a peer (actually also the p2p protocol state that peers resending txs should be disconnected) and we also want to avoid reprocessing a tx that we have already seen.

Those caches are quite basic a the moment, so they could be improved to better handle scenarios like the one you are reporting, and I could take a look at them, trying to remove from these cache, valid txs that are dropped from the txpool, this needs to be done with care to avoid being exposed to attacks.

As last a question about your network, is it common for a valid tx to stay in the pool for so long that it could be evicted by the timer?

nano-adhara · Answer 3 · Fri Jun 28 2024 16:08:17 GMT+0800 (China Standard Time)

Hello @fab-10,

So as an answer to your last question, is not happening that much, once a month, but the real problem with that is that whenever happens, the only solution is to restart the besu clients.

Ill include Fernando @chookly314 and Coenie @coeniebeyers in the topic to move forward in the conversation.