tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unexpected log output - appears to double sign

mdyring opened this issue · comments

We're testing an active/active setup of two validators connecting to a single tmkms v0.6.0-rc1 process on the gaia-13004 testnet.

AFAICS in the source, "signed at height" is only logged after a signature is created:https://github.com/tendermint/kms/blob/e5dadf1a7bef351ea78c082e1aefab3db44aedfc/src/session.rs#L221

So it appears tmkms is double signing for the same chain id?

tmkms[9826]: 12:49:39 [info] [gaia-13004@35.156.81.246:26659] signed "PreVote" at height: 55157 (0 ms)
tmkms[9826]: 12:49:39 [info] [gaia-13004@35.158.126.240:26659] signed "PreVote" at height: 55157 (0 ms)
tmkms[9826]: 12:49:39 [info] [gaia-13004@35.156.81.246:26659] signed "PreCommit" at height: 55157 (0 ms)
tmkms[9826]: 12:49:39 [info] [gaia-13004@35.158.126.240:26659] signed "PreCommit" at height: 55157 (0 ms)
tmkms[9826]: 12:49:45 [info] [gaia-13004@35.156.81.246:26659] signed "PreVote" at height: 55158 (0 ms)
tmkms[9826]: 12:49:45 [info] [gaia-13004@35.158.126.240:26659] signed "PreVote" at height: 55158 (0 ms)
tmkms[9826]: 12:49:45 [info] [gaia-13004@35.156.81.246:26659] signed "PreCommit" at height: 55158 (0 ms)
tmkms[9826]: 12:49:45 [info] [gaia-13004@35.158.126.240:26659] signed "PreCommit" at height: 55158 (0 ms)

Our validator is not getting slashed so far.. Maybe because we have yet to propose a block and have been lucky so far (prevoting/commiting on same data on both vals).

I should probably have it print out the entire consensus state here rather than just the height. There's also the round and step to consider, which is my guess of what's happening here.

PR to add height/round/step logging here: https://github.com/tendermint/kms/pull/316/files

@mdyring that's merged to master now... are you capable of building from that to test this out again?

Thanks, I've tested with master and it seems something is fishy:

tmkms[26754]: 20:25:02 [info] [gaia-13004@35.156.81.246:26659] signed "PreVote" at height/round/step: 59807/0/6 (0 ms)
tmkms[26754]: 20:25:02 [info] [gaia-13004@35.158.126.240:26659] signed "PreVote" at height/round/step: 59807/0/6 (0 ms)
tmkms[26754]: 20:25:02 [info] [gaia-13004@35.156.81.246:26659] signed "PreCommit" at height/round/step: 59807/0/6 (0 ms)
tmkms[26754]: 20:25:02 [info] [gaia-13004@35.158.126.240:26659] signed "PreCommit" at height/round/step: 59807/0/6 (0 ms)
tmkms[26754]: 20:25:09 [info] [gaia-13004@35.158.126.240:26659] signed "PreVote" at height/round/step: 59808/0/6 (0 ms)
tmkms[26754]: 20:25:09 [info] [gaia-13004@35.156.81.246:26659] signed "PreVote" at height/round/step: 59808/0/6 (0 ms)
tmkms[26754]: 20:25:09 [info] [gaia-13004@35.156.81.246:26659] signed "PreCommit" at height/round/step: 59808/0/6 (0 ms)
tmkms[26754]: 20:25:09 [info] [gaia-13004@35.158.126.240:26659] signed "PreCommit" at height/round/step: 59808/0/6 (0 ms)
tmkms[26754]: 20:25:15 [info] [gaia-13004@35.156.81.246:26659] signed "PreVote" at height/round/step: 59809/0/6 (0 ms)
tmkms[26754]: 20:25:15 [info] [gaia-13004@35.158.126.240:26659] signed "PreVote" at height/round/step: 59809/0/6 (0 ms)
tmkms[26754]: 20:25:15 [info] [gaia-13004@35.156.81.246:26659] signed "PreCommit" at height/round/step: 59809/0/6 (0 ms)
tmkms[26754]: 20:25:15 [info] [gaia-13004@35.158.126.240:26659] signed "PreCommit" at height/round/step: 59809/0/6 (0 ms)
tmkms[26754]: 20:25:20 [info] [gaia-13004@35.156.81.246:26659] signed "PreVote" at height/round/step: 59810/0/6 (0 ms)
tmkms[26754]: 20:25:20 [info] [gaia-13004@35.158.126.240:26659] signed "PreVote" at height/round/step: 59810/0/6 (0 ms)

I am testing this with softsign, but don't believe it makes a difference. Also happy to share config if you'd like.

Interestingly, on our current production setup, with a single connection from kms -> val, we see two log files for each block - so also not a line for each step as you'd expect:

tmkms[25012]: 20:31:24 [info] [cosmoshub-2@x.x.x.x:26659] signed "PreVote" at height: 1192058 (142 ms)
tmkms[25012]: 20:31:25 [info] [cosmoshub-2@x.x.x.x:26659] signed "PreCommit" at height: 1192058 (142 ms)
tmkms[25012]: 20:31:31 [info] [cosmoshub-2@x.x.x.x:26659] signed "PreVote" at height: 1192059 (142 ms)
tmkms[25012]: 20:31:32 [info] [cosmoshub-2@x.x.x.x:26659] signed "PreCommit" at height: 1192059 (142 ms)
tmkms[25012]: 20:31:38 [info] [cosmoshub-2@x.x.x.x:26659] signed "PreVote" at height: 1192060 (142 ms)
tmkms[25012]: 20:31:38 [info] [cosmoshub-2@x.x.x.x:26659] signed "PreCommit" at height: 1192060 (142 ms)

I'm now guessing both validators are submitting an identical block and requesting it be signed, which is allowed under the current logic (and results in an identical signature, as Ed25519 is deterministic):

https://github.com/tendermint/kms/blob/master/src/chain/state.rs#L97

I can add the block ID being signed to the log output to confirm this.

I think this should be allowed, since it tolerates faults where the KMS signs a block for a validator which goes down. If another validator asks to sign the same block as the original, and the signature algorithm is deterministic, it's really just recomputing the original signature.

Interestingly, on our current production setup, with a single connection from kms -> val, we see two log files for each block

Those are the steps of Tendermint consensus: PreVote -> PreCommit -> Commit (where the first two each require a signature)

PR to add block ID logging and a [dup] flag when the KMS resigns the same block here: #317

@mdyring want to try it out again with the logging changes from #317 and see what the output is? They now look something like this and will add a [dup] flag when the KMS recomputes the same signature as the last block it signed:

22:47:37 [info] [test_chain_id@127.0.0.1:60313] signed PreVote:736F6D65 h/r/s:12345/2/6 (0 ms)

Happy to report it is as guessed (identical blocks) :-)

Minor detail: it shows [dup] for same PreVote/PreCommit. Since block id is now logged, I think we can safely remove the [dup] logging as it doesn't add that much value.

tmkms[18934]: 08:52:28 [info] [gaia-13004@35.156.81.246:26659] signed PreVote:48620B69 h/r/s:67611/0/6 (0 ms)
tmkms[18934]: 08:52:28 [info] [gaia-13004@35.158.126.240:26659] signed PreVote:48620B69 h/r/s:67611/0/6 (0 ms) [dup]
tmkms[18934]: 08:52:28 [info] [gaia-13004@35.156.81.246:26659] signed PreCommit:48620B69 h/r/s:67611/0/6 (0 ms) [dup]
tmkms[18934]: 08:52:28 [info] [gaia-13004@35.158.126.240:26659] signed PreCommit:48620B69 h/r/s:67611/0/6 (0 ms) [dup]
tmkms[18934]: 08:52:34 [info] [gaia-13004@35.156.81.246:26659] signed PreVote:51A5AB0D h/r/s:67612/0/6 (0 ms)
tmkms[18934]: 08:52:34 [info] [gaia-13004@35.158.126.240:26659] signed PreVote:51A5AB0D h/r/s:67612/0/6 (0 ms) [dup]
tmkms[18934]: 08:52:34 [info] [gaia-13004@35.156.81.246:26659] signed PreCommit:51A5AB0D h/r/s:67612/0/6 (0 ms) [dup]
tmkms[18934]: 08:52:34 [info] [gaia-13004@35.158.126.240:26659] signed PreCommit:51A5AB0D h/r/s:67612/0/6 (0 ms) [dup]

Thanks for the work on this and it appears we can safely do active/active now with multiple vals and a single KMS. I will increase number of vals for further testing, but feel free to close this issue.

Btw would be great to add block id logging when there is a problem, such as these:

tmkms[19431]: 11:30:11 [warn] [gaia-13004:35.158.126.240:26659] attempt to double sign at height/round/step: 69251/0/3
tmkms[19431]: 11:30:11 [error] [gaia-13004@tcp://18.196.63.108:26659] attempted double sign: step regression: round regression at height:69251 round:0 last step:6 new step:3

Yeah, the logging isn't correct, and making it correct would require changes to the state file, so I think I'll follow your recommendation and remove it for now, but add more block ID logging.

@mdyring #319 implemented your suggestions if you want to give it one more go before a final KMS release

Looks great, thanks!

Some minor details for consistency:

  1. added tcp:// prefix in the "attempted double sign" lines with step regression:
  2. warn is emitted for trying to double sign different blocks, error given for step regression
  3. "attempt to double sign" != "attempted double sign" (can make monitoring easier if consistent)
tmkms[3137]: 09:29:25 [info] [gaia-13004@35.156.81.246:26659] signed Proposal:DE23ECA605 at h/r/s 82953/0/3 (0 ms)
tmkms[3137]: 09:29:25 [warn] [gaia-13004:18.196.63.108:26659] attempt to double sign at h/r/s: 82953/0/3 (DE23ECA605 != AEFFF4498D)
tmkms[3137]: 09:29:25 [warn] [gaia-13004:35.158.126.240:26659] attempt to double sign at h/r/s: 82953/0/3 (DE23ECA605 != 1B867E0652)
tmkms[3137]: 09:29:25 [info] [gaia-13004@35.156.81.246:26659] signed PreCommit:(none) at h/r/s 82953/0/6 (0 ms)
tmkms[3137]: 09:29:25 [info] [gaia-13004@18.196.63.108:26659] signed PreCommit:(none) at h/r/s 82953/0/6 (0 ms)
tmkms[3137]: 09:29:25 [info] [gaia-13004@35.158.126.240:26659] signed PreCommit:(none) at h/r/s 82953/0/6 (0 ms)
tmkms[3137]: 09:29:25 [error] [gaia-13004@tcp://54.93.169.244:26659] attempted double sign: step regression: round regression at height:82953 round:0 last step:6 new step:3
tmkms[3137]: 09:29:25 [error] [gaia-13004@tcp://18.196.17.223:26659] attempted double sign: step regression: round regression at height:82953 round:0 last step:6 new step:3

@mdyring those should all be taken care of in #322