hyperledger / besu

An enterprise-grade Java-based, Apache 2.0 licensed Ethereum client https://wiki.hyperledger.org/display/besu

Home Page:https://www.hyperledger.org/projects/besu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could not confirm best peer had pivot block

macfarla opened this issue · comments

latest besu from this commit 27e5a64
1 of 6 Azure nodes stuck with this recurring error

{"@timestamp":"2024-05-16T03:45:24,313","level":"DEBUG","thread":"EthScheduler-Timer-0","class":"SyncTargetManager","message":"Could not confirm best peer had pivot block","throwable":" java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Timeout after 5000 MILLISECONDS\n\tat java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)\n\tat java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)\n\tat java.base/java.util.concurrent.CompletableFuture$OrApply.tryFire(CompletableFuture.java:1576)\n\tat java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194)\n\tat org.hyperledger.besu.ethereum.eth.manager.EthScheduler.lambda$failAfterTimeout$11(EthScheduler.java:301)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: java.util.concurrent.TimeoutException: Timeout after 5000 MILLISECONDS\n\tat org.hyperledger.besu.ethereum.eth.manager.EthScheduler.lambda$failAfterTimeout$11(EthScheduler.java:300)\n\t... 5 more\n"}

looks like a chain halt - nothing in INFO logs for 3h after this:

{"@timestamp":"2024-05-15T23:14:08,234","level":"INFO","thread":"EthScheduler-Services-36 (requestCompleteTask)","class":"SnapWorldDownloadState","message":"Pausing world state download while waiting for sync to complete","throwable":""} {"@timestamp":"2024-05-15T23:14:45,654","level":"INFO","thread":"EthScheduler-Services-279 (importBlock)","class":"ImportBlocksStep","message":"Block import progress: 12582224 of 19878555 (63%), Peer count: 25","throwable":""} {"@timestamp":"2024-05-15T23:22:41,709","level":"INFO","thread":"EthScheduler-Services-225 (importBlock)","class":"ImportBlocksStep","message":"Block import progress: 12606638 of 19878555 (63%), Peer count: 24","throwable":""} {"@timestamp":"2024-05-15T23:24:47,420","level":"INFO","thread":"EthScheduler-Services-258 (importBlock)","class":"ImportBlocksStep","message":"Block import progress: 12614838 of 19878555 (63%), Peer count: 25","throwable":""} {"@timestamp":"2024-05-15T23:32:05,457","level":"INFO","thread":"EthScheduler-Services-283","class":"SnapWorldDownloadState","message":"Running world state heal process from peers with pivot block 19878650","throwable":""} {"@timestamp":"2024-05-15T23:34:06,377","level":"INFO","thread":"EthScheduler-Services-283 (importBlock)","class":"ImportBlocksStep","message":"Block import progress: 12639013 of 19878650 (63%), Peer count: 25","throwable":""}

Restart resolved the issue for this node.

with the extra info in the log message,
{"@timestamp":"2024-05-20T21:37:46,523","level":"DEBUG","thread":"EthScheduler-Timer-0","class":"SyncTargetManager","message":"Could not confirm best peer 0xaaa... had pivot block 19913245","throwable":" java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Timeout after 5000 MILLISECONDS\n\tat java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)\n\tat java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)\n\tat java.base/java.util.concurrent.CompletableFuture$OrApply.tryFire(CompletableFuture.java:1576)\n\tat java.base/java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1219)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194)\n\tat org.hyperledger.besu.ethereum.eth.manager.EthScheduler.lambda$failAfterTimeout$11(EthScheduler.java:301)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: java.util.concurrent.TimeoutException: Timeout after 5000 MILLISECONDS\n\tat org.hyperledger.besu.ethereum.eth.manager.EthScheduler.lambda$failAfterTimeout$11(EthScheduler.java:300)\n\t... 5 more\n"}

could then use admin_removePeer to remove that problem peer, for a different way to recover

the problem is that we're calling this function recursively, and there is a repeated timeout exception - but we're still retrying because of the recursion. And the exceptionally() only returns once the other part completes

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CompletableFuture.html#exceptionally-java.util.function.Function-

Returns a new CompletableFuture that is completed when this CompletableFuture completes, with the result of the given function of the exception triggering this CompletableFuture's completion when it completes exceptionally; otherwise, if this CompletableFuture completes normally, then the returned CompletableFuture also completes normally with the same value.