tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Validator disconnect on step regression

mdyring opened this issue · comments

In below log, 5 validators are connected to single tmkms process on same chain-id. It appears that during a Proposal, some validators (.240 and .108 below) get disconnected due to step regression.

Is this as expected? It does not seem to happen with validators attempting to double sign with different block ids, so I would think not.

Disconnecting seems undesirable when trying to do HA. ;-)

tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://18.196.17.223:26659] signed Proposal:60E46FCB09 at h/r/s 88801/0/3 (0 ms)
tmkms[21952]: 18:52:22 [error] [gaia-13004:tcp://35.156.81.246:26659] attempted double sign at h/r/s: 88801/0/3 (60E46FCB09 != 85CC5AD4DD)
tmkms[21952]: 18:52:22 [error] [gaia-13004:tcp://54.93.169.244:26659] attempted double sign at h/r/s: 88801/0/3 (60E46FCB09 != 9DFE0E1AE9)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://18.196.17.223:26659] signed PreVote:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [error] [gaia-13004@tcp://35.158.126.240:26659] attempted double sign: step regression: round regression at height:88801 round:0 last step:6 new step:3
tmkms[21952]: 18:52:22 [error] [gaia-13004@tcp://18.196.63.108:26659] attempted double sign: step regression: round regression at height:88801 round:0 last step:6 new step:3
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://35.156.81.246:26659] signed PreVote:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://54.93.169.244:26659] signed PreVote:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://35.156.81.246:26659] signed PreCommit:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://18.196.17.223:26659] signed PreCommit:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:22 [info] [gaia-13004@tcp://54.93.169.244:26659] signed PreCommit:60E46FCB09 at h/r/s 88801/0/6 (0 ms)
tmkms[21952]: 18:52:23 [info] KMS node ID: 20E2B9BE81DD6CBB410B14FB59EBD4DBC9E0228C
tmkms[21952]: 18:52:23 [info] KMS node ID: 20E2B9BE81DD6CBB410B14FB59EBD4DBC9E0228C
tmkms[21952]: 18:52:23 [warn] [gaia-13004] 35.158.126.240:26659: unverified validator peer ID! (ADCA89A67DAA5FFC2D2EFFD8AD33B1DC8DCD9466)
tmkms[21952]: 18:52:23 [info] [gaia-13004@tcp://35.158.126.240:26659] connected to validator successfully
tmkms[21952]: 18:52:23 [warn] [gaia-13004] 18.196.63.108:26659: unverified validator peer ID! (00ACD9BDBE36B2933231545AEBF1D799DD06D83B)
tmkms[21952]: 18:52:23 [info] [gaia-13004@tcp://18.196.63.108:26659] connected to validator successfully

@mdyring I'd suggest opening a corresponding issue on https://github.com/tendermint/tendermint since I do not believe the KMS is closing the connection here, but rather my best guess is Tendermint is closing the connection after receiving a double-signing error response. It'd be good to confirm these are both the case, though.

@tarcieri done, thanks. From looking at Tendermint log it looks like it sees the connection close, I have added log output to tendermint/tendermint#3844.

Here's the relevant KMS codepath. It returns an error response to Tendermint, but otherwise considers the request successfully handled and therefore it shouldn't abort the connection.

You can also turn on debug logging on the KMS by passing the -v or --verbose option to tmkms start.

From looking at Tendermint log it looks like it sees the connection close, I have added log output to tendermint/tendermint#3844.

I looked at the attached log output but I don't see what you're talking about re: Tendermint showing the connection being remotely closed by the KMS.

Sorry, should have been clearer: First line has a "EOF" at the end, indicating the connection was closed.

This seems to indicate the connection closure is unexpected for Tendermint, but it could also just be a case of Tendermint closing the connection earlier and then failing on a read later and logging that.

Here's the relevant KMS codepath.

Isn't this only for the StateErrorKind::DoubleSign ? Martin's error seems to be of StateErrorKind::StepRegression ie. https://github.com/tendermint/kms/blob/3b3aabf8eab64959dad29ee2d5b747dc5fe813ac/src/chain/state.rs#L92-L99 which means sign would return an error, which I think exits the request_loop ?

@ebuchman good catch, that appears to be what's happening.

Perhaps the KMS should send back an error to Tendermint on StateErrorKind::StepRegression as well.