tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Avoiding duplicate signatures

mdyring opened this issue · comments

Related to #314, we're testing multiple (5) validators connecting to a single KMS process.

This turns up in the validator logs, so it seems the multiple PreVote/PreCommits are broadcast as expected:

gaiad[1486]: E[2019-07-27|09:20:43.397] Error attempting to add vote                 module=consensus err="Existing vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A A5AC3E828136 @ 2019-07-27T09:20:43.057849935Z}; New vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A 3F7B5DCF5942 @ 2019-07-27T09:20:43.061941252Z}: Non-deterministic signature"
gaiad[1486]: E[2019-07-27|09:20:43.405] Error attempting to add vote                 module=consensus err="Existing vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A A5AC3E828136 @ 2019-07-27T09:20:43.057849935Z}; New vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A 3BFFA669DF1C @ 2019-07-27T09:20:43.09168945Z}: Non-deterministic signature"
gaiad[1486]: E[2019-07-27|09:20:43.406] Error attempting to add vote                 module=consensus err="Existing vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A A5AC3E828136 @ 2019-07-27T09:20:43.057849935Z}; New vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A 26B42A52E8DA @ 2019-07-27T09:20:43.071819843Z}: Non-deterministic signature"
gaiad[1486]: E[2019-07-27|09:20:43.423] Error attempting to add vote                 module=consensus err="Existing vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A A5AC3E828136 @ 2019-07-27T09:20:43.057849935Z}; New vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A 9580C1CF6F62 @ 2019-07-27T09:20:43.13180719Z}: Non-deterministic signature"
gaiad[1486]: E[2019-07-27|09:20:43.511] Error attempting to add vote                 module=consensus err="Existing vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A A5AC3E828136 @ 2019-07-27T09:20:43.057849935Z}; New vote: Vote{8:32363E5BAA6B 67906/00/2(Precommit) 2031A4FCA22A 26B42A52E8DA @ 2019-07-27T09:20:43.071819843Z}: Non-deterministic signature"

According to this log output they are not duplicate when diving into the tendermint source:
https://github.com/tendermint/tendermint/blob/b73cfe878682e0b73d9a21ec1d8dc456ddd90215/types/vote_set.go#L181

Log from the KMS side for above block:

tmkms[19431]: 09:20:38 [info] [gaia-13004@35.156.81.246:26659] signed PreVote:2031A4FC h/r/s:67906/0/6 (0 ms)
tmkms[19431]: 09:20:38 [info] [gaia-13004@35.158.126.240:26659] signed PreVote:2031A4FC h/r/s:67906/0/6 (0 ms) [dup]
tmkms[19431]: 09:20:38 [info] [gaia-13004@18.196.63.108:26659] signed PreVote:2031A4FC h/r/s:67906/0/6 (0 ms) [dup]
tmkms[19431]: 09:20:38 [info] [gaia-13004@18.196.17.223:26659] signed PreVote:2031A4FC h/r/s:67906/0/6 (0 ms) [dup]
tmkms[19431]: 09:20:38 [info] [gaia-13004@54.93.169.244:26659] signed PreVote:2031A4FC h/r/s:67906/0/6 (0 ms) [dup]
tmkms[19431]: 09:20:38 [info] [gaia-13004@54.93.169.244:26659] signed PreCommit:2031A4FC h/r/s:67906/0/6 (0 ms) [dup]
tmkms[19431]: 09:20:38 [info] [gaia-13004@35.158.126.240:26659] signed PreCommit:2031A4FC h/r/s:67906/0/6 (0 ms) [dup]
tmkms[19431]: 09:20:38 [info] [gaia-13004@18.196.63.108:26659] signed PreCommit:2031A4FC h/r/s:67906/0/6 (0 ms) [dup]
tmkms[19431]: 09:20:38 [info] [gaia-13004@35.156.81.246:26659] signed PreCommit:2031A4FC h/r/s:67906/0/6 (0 ms) [dup]
tmkms[19431]: 09:20:38 [info] [gaia-13004@18.196.17.223:26659] signed PreCommit:2031A4FC h/r/s:67906/0/6 (0 ms) [dup]

I am not sure if this is as expected - perhaps it would be better to not provide a signature for duplicate block ids?

Just for additional info, this is how things look when proposing (5 vals, one if which makes the race):

tmkms[19431]: 12:00:24 [warn] [gaia-13004:35.156.81.246:26659] attempt to double sign at height/round/step: 69564/0/3
tmkms[19431]: 12:00:24 [info] [gaia-13004@18.196.17.223:26659] signed Proposal:B38091FB h/r/s:69564/0/3 (0 ms)
tmkms[19431]: 12:00:24 [warn] [gaia-13004:35.158.126.240:26659] attempt to double sign at height/round/step: 69564/0/3
tmkms[19431]: 12:00:24 [warn] [gaia-13004:18.196.63.108:26659] attempt to double sign at height/round/step: 69564/0/3
tmkms[19431]: 12:00:24 [info] [gaia-13004@18.196.17.223:26659] signed PreVote:B38091FB h/r/s:69564/0/6 (0 ms)
tmkms[19431]: 12:00:24 [info] [gaia-13004@35.156.81.246:26659] signed PreVote:B38091FB h/r/s:69564/0/6 (0 ms) [dup]
tmkms[19431]: 12:00:24 [error] [gaia-13004@tcp://54.93.169.244:26659] attempted double sign: step regression: round regression at height:69564 round:0 last step:6 new step:3
tmkms[19431]: 12:00:24 [info] [gaia-13004@18.196.63.108:26659] signed PreVote:B38091FB h/r/s:69564/0/6 (0 ms) [dup]
tmkms[19431]: 12:00:24 [info] [gaia-13004@35.158.126.240:26659] signed PreVote:B38091FB h/r/s:69564/0/6 (0 ms) [dup]
tmkms[19431]: 12:00:24 [info] [gaia-13004@35.158.126.240:26659] signed PreCommit:B38091FB h/r/s:69564/0/6 (0 ms) [dup]
tmkms[19431]: 12:00:24 [info] [gaia-13004@18.196.17.223:26659] signed PreCommit:B38091FB h/r/s:69564/0/6 (0 ms) [dup]
tmkms[19431]: 12:00:24 [info] [gaia-13004@35.156.81.246:26659] signed PreCommit:B38091FB h/r/s:69564/0/6 (0 ms) [dup]
tmkms[19431]: 12:00:24 [info] [gaia-13004@18.196.63.108:26659] signed PreCommit:B38091FB h/r/s:69564/0/6 (0 ms) [dup]
tmkms[19431]: 12:00:25 [info] [gaia-13004@54.93.169.244:26659] signed PreCommit:B38091FB h/r/s:69564/0/6 (0 ms) [dup]

Those are definitely duplicates (the block IDs and h/r/s confirm it) and the "Non-deterministic signature" messages occur somewhat routinely during Tendermint consensus.

If you aren't getting permanently jailed you haven't experienced a double signing event.

Note: I will have someone from Tendermint take a look at this to double check before closing this issue

Also note that if I were simply to change the logic to disallow signing blocks with the same h/r/s and block ID, it wouldn't sign precommits after signing a prevote, which would break the KMS entirely.

So Tendermint signatures are currently non deterministic because the message includes a timestamp.

This error message will be seen when Tendermint sees different signatures for the same block id at the same height.. In a setup where multiple full nodes are talking to the same signing oracle, this expected behavior and should be the most fault tolerant behavior.

Thanks for testing this.

Thanks for clarifying @zmanian.

We're planning to run with this setup on cosmoshub-3, and I think others might want to do the same down the line, but I think the log output of Tendermint with errors severity will be a distraction.

Ideally I think this setup should not log consensus errors, so thinking about two options:

  1. KMS side: store what action was signed (prevote/precommit/proposal) and only provide a single signature per <action, height,round,step>.
  2. Simply request tendermint team to log these different signatures with info or debug severity instead of as an error.

Not sure if 1) is even possible, can KMS return an error/empty signature to Tendermint yet?

I think 1 is bad from an availability perspective, and allowing it has no disadvantages other than surfacing to the user what is happening in a non-terrifying way.

The real fix for 2 is to remove the timestamp from signed blocks in which case the signatures will be fully deterministic. I think this is going to happen soon?

can KMS return an error/empty signature to Tendermint yet?

It already returns errors in event a validator requests signing a different block ID at the same height/round/step.

Thanks @tarcieri, see #323 related to last point.

I'll nudge the Tendermint devs to decrease severity level for this.

I think 1 is bad from an availability perspective, and allowing it has no disadvantages other than surfacing to the user what is happening in a non-terrifying way.

On second thought:

I agree availability might be slightly better when signing everything as there is a better chance of getting the signature propagated throughout the network.

However I think signing "aggressively" with the YubiHSM and Ledger is undesirable due to the limited capacity of these devices. On a YubiHSM, signing for a single chain with two validators will be getting close to capacity.

Regarding doing anything about it prior to the release: even if we were to do something, doing it now would be a big invasive change right before a release, where all of the other things I've been doing have been tweaking loglines. I don't think it's worth delaying the release over, and I think it's a risky thing to change at the last minute.

When I did tweak the loglines though, I made all double signing attempts errors, because I do consider them errors.

I think having several validator instances haphazardly hammering the KMS in a completely uncoordinated manner to produce duplicate signatures is not a good steady state of affairs and also makes the KMS the one vital lynchpin preventing double signing.

It would be better to use some external coordination service to elect one validator as active, and handle failover when that validator goes inactive. This is the approach commonly used by e.g. Certificate Transparency log signers. This also provides a belt-and-suspenders approach to preventing double signing.

Until there's good solutions for that, however, I'd agree this is the best we can do. That said we only plan on using this configuration for failover and don't plan on running it in perpetuity.

Agree on wrapping the 0.6.0 release without major changes and thanks for your patience with all these last minute issues.

Since we are already using YubiHSM and thus relying on the tmkms to prevent double signing in software, I am not (very) worried about adding multiple validators. It seems the main risk would be a race condition bypassing the double signing prevention.

Once we are confident this is working well by stress-testing with 5 validators, we'll probably run a setup with two validators and a single tmkms instance to allow for regular maintenance on the validators without missing blocks. Would love to see a specialised solution for this, but I think tmkms can serve this specific purpose well.

Yeah this is something we can definitely do better on.

But it requires some careful thinking on how.

Everything gets way better once we remove timestamps from votes cause then we can just have a warm cache for signatures