tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Restarting tmkms leads to Tendermint "remote signer timed out"

mdyring opened this issue · comments

Hi guys,

Sorry, not sure if the right place to report this is Tendermint or KMS repo. It seems very KMS specific, so trying here first.

Doing some testing and have tmkms running as a systemd service alongside gaia. When I restart tmkms, Tendermint does not recover gracefully and needs to be restarted as well.

Nov 24 22:50:04 val2 gaiad[86004]: I[24116-11-24|21:50:04.406] Starting BlockPool                           module=blockchain impl=BlockPool
Nov 24 22:50:04 val2 gaiad[86004]: I[24116-11-24|21:50:04.406] Starting IndexerService                      module=txindex impl=IndexerService
Nov 24 22:52:08 val2 systemd[1]: Stopping Tendermint KMS Service...
Nov 24 22:52:08 val2 systemd[1]: Stopped Tendermint KMS Service.
Nov 24 22:52:08 val2 systemd[1]: Started Tendermint KMS Service.
Nov 24 22:52:08 val2 gaiad[86004]: E[24116-11-24|21:52:08.402] Ping                                         module=privval err=EOF
Nov 24 22:52:08 val2 kernel: usb 1-3: reset full-speed USB device number 23 using xhci_hcd
Nov 24 22:52:08 val2 kernel: usb 1-10: reset full-speed USB device number 5 using xhci_hcd
Nov 24 22:52:09 val2 kernel: usb 1-10: reset full-speed USB device number 5 using xhci_hcd
Nov 24 22:52:10 val2 gaiad[86004]: E[24116-11-24|21:52:10.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:12 val2 gaiad[86004]: E[24116-11-24|21:52:12.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:14 val2 gaiad[86004]: E[24116-11-24|21:52:14.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:16 val2 gaiad[86004]: E[24116-11-24|21:52:16.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:18 val2 gaiad[86004]: E[24116-11-24|21:52:18.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:20 val2 gaiad[86004]: E[24116-11-24|21:52:20.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:22 val2 gaiad[86004]: E[24116-11-24|21:52:22.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:24 val2 gaiad[86004]: E[24116-11-24|21:52:24.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:26 val2 gaiad[86004]: E[24116-11-24|21:52:26.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:28 val2 gaiad[86004]: E[24116-11-24|21:52:28.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:30 val2 gaiad[86004]: E[24116-11-24|21:52:30.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:32 val2 gaiad[86004]: E[24116-11-24|21:52:32.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:34 val2 gaiad[86004]: E[24116-11-24|21:52:34.401] Ping                                         module=privval err="remote signer timed out"
Nov 24 22:52:36 val2 gaiad[86004]: E[24116-11-24|21:52:36.401] Ping                                         module=privval err="remote signer timed out"

Restarting after the above errors, this also presents itself:

Nov 24 22:57:26 val2 systemd[1]: Started Gaia Service.
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.468] Starting ABCI with Tendermint                module=main
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] Starting multiAppConn                        module=proxy impl=multiAppConn
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] Starting localClient                         module=abci-client connection=query impl=localClient
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] Starting localClient                         module=abci-client connection=mempool impl=localClient
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] Starting localClient                         module=abci-client connection=consensus impl=localClient
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] ABCI Handshake App Info                      module=consensus height=0 hash= software-version= protocol-version=0
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] ABCI Replay Blocks                           module=consensus appHeight=0 storeHeight=0 stateHeight=0
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.544] Completed ABCI Handshake - Tendermint and App are synced module=consensus appHeight=0 appHash=
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.544] Starting TCPVal                              module=privval impl=TCPVal
Nov 24 22:57:29 val2 gaiad[86067]: E[24116-11-24|21:57:29.544] OnStart                                      module=privval err="accept tcp 127.0.0.1:26658: i/o timeout"
Nov 24 22:57:29 val2 gaiad[86067]: ERROR: Error starting private validator client: accept tcp 127.0.0.1:26658: i/o timeout
Nov 24 22:57:29 val2 systemd[1]: gaia.service: Main process exited, code=exited, status=1/FAILURE
Nov 24 22:57:29 val2 systemd[1]: gaia.service: Unit entered failed state.
Nov 24 22:57:29 val2 systemd[1]: gaia.service: Failed with result 'exit-code'.
Nov 24 22:57:32 val2 systemd[1]: gaia.service: Service hold-off time over, scheduling restart.
Nov 24 22:57:32 val2 systemd[1]: Stopped Gaia Service.
Nov 24 22:57:32 val2 systemd[1]: Started Gaia Service.
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.087] Starting ABCI with Tendermint                module=main
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] Starting multiAppConn                        module=proxy impl=multiAppConn
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] Starting localClient                         module=abci-client connection=query impl=localClient
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] Starting localClient                         module=abci-client connection=mempool impl=localClient
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] Starting localClient                         module=abci-client connection=consensus impl=localClient
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] ABCI Handshake App Info                      module=consensus height=0 hash= software-version= protocol-version=0
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] ABCI Replay Blocks                           module=consensus appHeight=0 storeHeight=0 stateHeight=0
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.163] Completed ABCI Handshake - Tendermint and App are synced module=consensus appHeight=0 appHash=
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.163] Starting TCPVal                              module=privval impl=TCPVal
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.124] This node is not a validator                 module=consensus addr=2369786F94AECAABEE11A1242A395EC9C6303BF9 pubKey=PubKeyEd25519{6C0B225542087B267B312F09424CF9E58C23519F9EC7B85181E036BB8E20E720}
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.127] P2P Node ID                                  module=p2p ID=06430257c53430df262d5010a26175db590b4154 file=/config/node_key.json
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.127] Starting Node                                module=node impl=Node
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.127] Starting EventBus                            module=events impl=EventBus
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.127] Starting PubSub                              module=pubsub impl=PubSub

I notice that often Tendermint will time out rather quickly when waiting for tmkms, as shown in the above as well.

Thanks for the great work so far.

Known issue, but thanks for reporting it! Here's the relevant Tendermint issue:

tendermint/tendermint#2876

Hope it will be repaired before the main network launch

So we can use KMS to build validator with high availability

This was fixed upstream about a month ago, and shouldn't be a problem in e.g. cosmos-sdk v0.29. Are you still experiencing it?

Edit: never mind, I see the discussion on tendermint/tendermint#2923 now.

Note the particular issue in this ticket is unrelated: this issue was specifically about having Tendermint detect that the KMS socket had closed, and automatically reconnecting. The issue on tendermint/tendermint#2923 is error handling.

Yes, for realize HA and prevent Double Sign, i deployed three gaiad with same pubkey, and deployed the corresponding kms for each gaiad. Three kms shares the same "height_current", so once certain height no more than "height_current", will be reject by kms.

In my imagination, the rejected validator will synchronized from the main network. I need to test whether this is feasible.

But, the gaiad cannot reconnect after being rejected by kms(or kms restart, i think is same problem?), so i think it's an important issue. Hope to be repaired before the launch of main network. :)