1.26 on hash consumed almost all memory and CPU, did not process blocks

Question

1.26 on hash consumed almost all memory and CPU, did not process blocks

yorickdowne opened this issue 3 months ago · comments

Description

Nethermind 1.26 did not process blocks and gradually consumed all memory, and had high CPU.

A restart fixed it, it caught up again. It would not quit cleanly however, Docker reaped it after the 5min configured timeout for container stop.

CL is Nimbus v24.4.0

Logs don't appear to show a clear root cause.

Nethermind is started with these pruning parameters. It was not running a Full Prune at the time. --Pruning.FullPruningMaxDegreeOfParallelism=3 --Pruning.FullPruningTrigger=VolumeFreeSpace --Pruning.FullPruningThresholdMb=375810 --Pruning.CacheMb=4096 --Pruning.FullPruningMemoryBudgetMb=16384 --Init.StateDbKeyScheme=HalfPath

Full startup parameters as shown by ps auxww:

/nethermind/nethermind --datadir /var/lib/nethermind --Init.WebSocketsEnabled true --Network.DiscoveryPort 30303 --Network.P2PPort 30303 --Network.MaxActivePeers 50 --HealthChecks.Enabled true --HealthChecks.UIEnabled true --JsonRpc.Enabled true --JsonRpc.Host 0.0.0.0 --JsonRpc.Port 8545 --JsonRpc.WebSocketsPort 8546 --JsonRpc.EngineHost 0.0.0.0 --JsonRpc.EnginePort 8551 --JsonRpc.AdditionalRpcUrls=http://127.0.0.1:1337|http|admin --JsonRpc.JwtSecretFile=/var/lib/nethermind/ee-secret/jwtsecret --Metrics.Enabled true --Metrics.PushGatewayUrl  --Metrics.ExposeHost 0.0.0.0 --Metrics.ExposePort 6060 --Pruning.FullPruningCompletionBehavior AlwaysShutdown --log info --config mainnet --JsonRpc.EnabledModules Web3,Eth,Subscribe,Net,Health,Parity,Proof,Trace,TxPool --Pruning.FullPruningMaxDegreeOfParallelism=3 --Pruning.FullPruningTrigger=VolumeFreeSpace --Pruning.FullPruningThresholdMb=375810 --Pruning.CacheMb=4096 --Pruning.FullPruningMemoryBudgetMb=16384 --Init.StateDbKeyScheme=HalfPath

Steps to Reproduce

Unsure

Desktop (please complete the following information):
Please provide the following information regarding your setup:

Operating System: Linux
Version: 1.26.0+0068729c
Installation Method: Docker
Consensus Client: Nimbus v24.4.0

Logs

Nimbus logs

Logs of Nethermind in its failure state

Logs of Nethermind after restart, when it catches up

nimbus.log.gz

nm-hang.log.gz

nm-catchup.log.gz

yorickdowne · Answer 1 · Wed May 01 2024 21:00:46 GMT+0800 (China Standard Time)

This happened again on another machine with a Lodestar CL. It's unlikely be an interop issue then. I will remove --Pruning.CacheMb=4096 to see whether that makes a difference.

Amirul Ashraf · Answer 2 · Thu May 02 2024 10:04:01 GMT+0800 (China Standard Time)

Hi, do you have log with the Lodestar CL? Also, can you double check that the CL is synced?

yorickdowne · Answer 3 · Fri May 03 2024 16:12:33 GMT+0800 (China Standard Time)

We have the logs in Loki, let me see what I can pull out. The CL is synced, yes.

Since removing --Pruning.CacheMb=4096 from all our nodes we haven't seen a failure in 2 days, where previously we had 2 different ones fail in 2 days.

Amirul Ashraf · Answer 4 · Wed May 08 2024 14:21:57 GMT+0800 (China Standard Time)

May I know why --Init.StateDbKeyScheme=HalfPath is on when running hash? Do you plan to migrate via full pruning?

yorickdowne · Answer 5 · Wed May 08 2024 16:50:31 GMT+0800 (China Standard Time)

Yes and yes. The intent was to allow the auto prune to migrate.

Have not seen another failure since removing the pruning cache parameter

Amirul Ashraf · Answer 6 · Mon May 20 2024 15:58:44 GMT+0800 (China Standard Time)

Did not reproduce when forward syncing. Problem may come from outside block processing.

yorickdowne · Answer 7 · Mon May 20 2024 17:09:23 GMT+0800 (China Standard Time)

I’ve seen this once more, it just took those 2 weeks.

I’ve configured one of my servers with the cache parameter and debug logs, to hopefully catch this when it happens

Amirul Ashraf · Answer 8 · Mon May 20 2024 17:14:25 GMT+0800 (China Standard Time)

Are those running hash also or halfpath?

yorickdowne · Answer 9 · Mon May 20 2024 17:29:58 GMT+0800 (China Standard Time)

This is hash, with the HalfPath parameter to convert during pruning.

Amirul Ashraf · Answer 10 · Mon May 20 2024 17:41:19 GMT+0800 (China Standard Time)

Are these node validator?

yorickdowne · Answer 11 · Mon May 20 2024 18:43:02 GMT+0800 (China Standard Time)

Yes, they are part of my Lido setup. Nimbus or Lodestar depending on the server; 1,000 validators via Vouch per cluster. Means there is a decent amount of block building going on.

Amirul Ashraf · Answer 12 · Tue May 21 2024 09:46:54 GMT+0800 (China Standard Time)

Hmm... simply re-running blocks with block producer's block processor does not work. Need actual block production.

gituser · Answer 13 · Tue Jun 18 2024 23:25:44 GMT+0800 (China Standard Time)

I'm having similar issue on v1.26.0 except in my case Nethermind process became totally unresponsive, eventually turning into Zombie.

Here is my configuration:

config.cfg

{ "Init": { "ChainSpecPath": "chainspec/sepolia.json", "GenesisHash": "0x25a5cc106eea7138acab33231d7160d69cb777ee0c2c553fcddf5138993e6dd9", "BaseDbPath": "nethermind_db/sepolia", "LogFileName": "sepolia.logs.txt", "StaticNodesPath": "Data/static-nodes-sepolia.json", "MemoryHint": 1048000000 }, "TxPool": { "Size": 1024 }, "Metrics": { "NodeName": "Sepolia" }, "Sync": { "FastSync": true, "FastBlocks": true, "SnapSync": true, "UseGethLimitsInFastBlocks": true, "PivotNumber": 4445000, "PivotHash": "0xed83dc15a64c1fc4f9367b602fc7396f62337f7494e34d1bbca0a8f172d4b404", "PivotTotalDifficulty": "17000018015853232", "FastSyncCatchUpHeightDelta": "10000000000", "NonValidatorNode": true, "DownloadBodiesInFastSync": false, "DownloadReceiptsInFastSync": false, "DbTune": "HeavyWrite" }, "Db": { "EnableMetricsUpdater": true }, "Blocks": { "TargetBlockGasLimit": 30000000 },

"Merge": {
"Enabled": true
},
"Pruning": {
"Mode": "Hybrid",
"FullPruningMemoryBudgetMb": 10000,
"FullPruningThresholdMb": 50000,
"FullPruningTrigger": "StateDbSize",
"PersistenceInterval": 16384
},
"JsonRpc": {
"Enabled": true,
"Timeout": 20000,
"Host": "0.0.0.0",
"Port": 8545,
"EnabledModules": [
"Eth",
"Subscribe",
"Trace",
"TxPool",
"Web3",
"Personal",
"Proof",
"Net",
"Parity",
"Health",
"Rpc"
],
"AdditionalRpcUrls": [
"http://127.0.0.1:8551|http;ws|net;eth;subscribe;engine;web3;client"
],
"JwtSecretFile":"/home/nethermind/.nethermind/jwt_secret",
"IpcUnixDomainSocketPath":"/home/nethermind/.nethermind/nethermind.ipc"
},
"TraceStore": {
"Enabled": true,
"BlocksToKeep": 50000,
"TraceTypes": "Trace,Rewards"
}
}

Log file generation just stopped, there were many CLOSE_WAIT connections and in strace of the process there was:

[pid 164787] futex(0x55bcad676158, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 164743] futex(0x55bcad676158, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 164742] futex(0x55bcad676158, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 164741] futex(0x55bcad676158, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 164713] futex(0x55bcad91fa8c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 164712] futex(0x55bcad676158, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 164662] futex(0x55bcad676158, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 164648] futex(0x55bcad676158, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 164639] futex(0x55bcad676158, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 164563] futex(0x55bcad676158, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 84762] futex(0x7ec510158374, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 84761] futex(0x7ec510158370, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 84760] futex(0x7ec510158370, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 84759] futex(0x7ec510158374, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 84758] futex(0x7ec510158370, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 84757] futex(0x7ec510158374, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 84756] futex(0x7ec510158374, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 84755] futex(0x7ec510158374, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 84754] futex(0x7ec510158374, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>

I never had such issue with v1.25.4, so it's something in the new v1.26.0 version.

I've removed CacheMB parameter as well from the configuration and now monitoring if the issue will reproduce again.

NOTE: it reproduced for me on the sepolia testnet, but I'm sure on the mainnet should be the same bug.

Kamil Chodoła · Answer 14 · Mon Jun 24 2024 15:53:30 GMT+0800 (China Standard Time)

@gituser @yorickdowne
With new 1.27.0 version released which addressed some of similar issues can you retry once more and see if you reproduce that?

yorickdowne · Answer 15 · Mon Jun 24 2024 18:26:42 GMT+0800 (China Standard Time)

Sure. I'll reintroduce --Pruning.CacheMb=4096 and see what happens

All but two of my Nethermind boxen are now on HalfPath, migrated by their last automatic prune