tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Errors when connecting to multiple validators with same chain-id

mdyring opened this issue · comments

When configuring two validator entries for the same network, tmkms logs an error when (I assume) both validators attempt to propose a block. This also leads to disconnect of the late validator, which is undesirable for HA.

May 01 15:56:43 kms2 tmkms[18390]: 13:56:43 [ERROR] [gaia-13003@tcp://3.121.125.121:26659] attempted double sign: double sign detected: Attempting to sign a second proposal at height:354800 round:0 step:3 old block id:B91381655C74D670310D7735E8478272E55163A246D294DDDFA1C2752171D66B new block 0A48BC1A7AF2E8423288E579A49F578C29DD2DDB3D0DA3A64D5BD1BA3991FA39
May 01 15:56:44 kms2 tmkms[18390]: 13:56:44 [INFO] KMS node ID: CB3AFD8C5DFA775F16E2CF752C2E10D3F1784289
May 01 15:56:45 kms2 tmkms[18390]: 13:56:45 [WARN] [gaia-13003] 3.121.125.121:26659: unverified validator peer ID! (2AC5127435FF8CE16B70690A7430D570AE7F25D0)

Related to tendermint/tendermint#3583, the Tendermint side is fairly vocal about non-deterministic signatures:

May  1 14:31:57 i-016ed17cbf9fe0bec gaiad[13908]: E[2019-05-01|14:31:57.848] Error attempting to add vote                 module=consensus err="Existing vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 E76B2DEE7AE3 @ 2019-05-01T14:31:57.451041654Z}; New vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 83927177EB9A @ 2019-05-01T14:31:57.352078209Z}: Non-deterministic signature"
May  1 14:31:57 i-016ed17cbf9fe0bec gaiad[13908]: E[2019-05-01|14:31:57.861] Error attempting to add vote                 module=consensus err="Existing vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 E76B2DEE7AE3 @ 2019-05-01T14:31:57.451041654Z}; New vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 83927177EB9A @ 2019-05-01T14:31:57.352078209Z}: Non-deterministic signature"
May  1 14:31:57 i-016ed17cbf9fe0bec gaiad[13908]: E[2019-05-01|14:31:57.878] Error attempting to add vote                 module=consensus err="Existing vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 E76B2DEE7AE3 @ 2019-05-01T14:31:57.451041654Z}; New vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 83927177EB9A @ 2019-05-01T14:31:57.352078209Z}: Non-deterministic signature"
May  1 14:31:57 i-016ed17cbf9fe0bec gaiad[13908]: E[2019-05-01|14:31:57.968] Error attempting to add vote                 module=consensus err="Existing vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 E76B2DEE7AE3 @ 2019-05-01T14:31:57.451041654Z}; New vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 83927177EB9A @ 2019-05-01T14:31:57.352078209Z}: Non-deterministic signature"
May  1 14:31:57 i-016ed17cbf9fe0bec gaiad[13908]: E[2019-05-01|14:31:57.973] Error attempting to add vote                 module=consensus err="Existing vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 E76B2DEE7AE3 @ 2019-05-01T14:31:57.451041654Z}; New vote: Vote{14:5B2ECC280D35 355179/00/2(Precommit) 73FB26ED5F34 83927177EB9A @ 2019-05-01T14:31:57.352078209Z}: Non-deterministic signature"

Assuming multiple active validators is to be supported, I believe the behavior should be modified to not disconnect.

It would probably also be beneficial to have a mechanism whereby KMS can tell Tendermint "I won't sign this block, but please don't panic" (or something to that effect).

Some other related issues:

I'd agree this is the simplest first step to support an HA validator setup.

Before we starting going down any particular path in this regard, I think it'd be good to have a rough high-level plan from the Tendermint team regarding how HA setups like this should work in general. Notably this approach leaves little margin for error if there's ever a bug in the KMS's double signing detection.

Another thing to consider is having two KMS instances connecting to both validators. Right now that's uncoordinated, so I'd be a bit worried that if the KMS processes are uncoordinated, and multiple validator instances are delivering signing requests simultaneously, that there's potential for double signing, particularly if the validator and KMS hosts are uncoordinated. Something needs to be in charge when determining which validator and/or KMS instances are active and signing at a given time.

Some precedent for this sort of thing is Google's Certificate Transparency logs. Google's approach is to run 5 instances of each log in a georeplicated manner, and use their internal Chubby locking service (similar to Zookeeper/etcd) to elect which one is active at a given time. That sort of approach seems safer to me. CT faces similar risks in that "double signing" (see this example of where things went wrong).

Our use-case is an active/passive KMS setup, with the active node connecting to two+ validators.
Failing over the KMS would require manual intervention.

Specifically we are not looking to run multiple active KMS, not connecting multiple KMS processes to a single Tendermint instance.

Since I assume the KMS codebase will stabilize, it should be fairly straight-forward to keep the KMS running continously. The big advantage of Tendermint HA would be the ability to update validators without downtime, where I'd expect this to be a much more frequent need.

I think this setup is a very good starting point of Tendermint HA and would greatly improve the validator
operation experience. It seems it is almost supported, but ideally without the KMS disconnect ;-).