hyperledger-archives / sawtooth-pbft

Sawtooth PBFT consensus engine

Home Page:https://wiki.hyperledger.org/display/sawtooth

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Possible Bug in Consensus

chaegleyOnclave opened this issue · comments

I've been noticing an issue when using PBFT consensus and hoped I could find some help here.
When two different nodes attempt to write different blocks at very similar times, one node will beat the other, and the other node will fail to write a block. That is expected, but then the node that failed to write a block will report this failed to cancel block error:

INFO | pbft_engine::node:47 | Failed to cancel block when becoming secondary: InvalidState("Cannot cancel block in current state")

This leads to the pbft engine behaving irregularly and eventually crashing.

When the pbft engine eventually crashes, it will state there has been a zmq error which states socket dropped with little other context.

Before the crash it is unable to properly use consensus.

Other nodes may also crash when this failure state is met, even though they did not fail to write a block.

Stopping and rebuilding the docker containers tends to fix this issue, but it is concerning that it occurs at all, and that other nodes fail that did not enter this error state.

I realized that the nodes I am running are running pfbt engine version 1.0.2, and I am planning on upgrading to the latest version. However, I'm uncertain if that will prevent this issue from happening again. I so far have not been able to consistently replicate the issue as it is a specific timing error that occurs when two nodes are attempting to write a block at very similar times. However, I have seen it occur multiple times and am concerned about pbft's stability.