Restarting tmkms leads to Tendermint "remote signer timed out"
mdyring opened this issue · comments
Hi guys,
Sorry, not sure if the right place to report this is Tendermint or KMS repo. It seems very KMS specific, so trying here first.
Doing some testing and have tmkms running as a systemd service alongside gaia. When I restart tmkms, Tendermint does not recover gracefully and needs to be restarted as well.
Nov 24 22:50:04 val2 gaiad[86004]: I[24116-11-24|21:50:04.406] Starting BlockPool module=blockchain impl=BlockPool
Nov 24 22:50:04 val2 gaiad[86004]: I[24116-11-24|21:50:04.406] Starting IndexerService module=txindex impl=IndexerService
Nov 24 22:52:08 val2 systemd[1]: Stopping Tendermint KMS Service...
Nov 24 22:52:08 val2 systemd[1]: Stopped Tendermint KMS Service.
Nov 24 22:52:08 val2 systemd[1]: Started Tendermint KMS Service.
Nov 24 22:52:08 val2 gaiad[86004]: E[24116-11-24|21:52:08.402] Ping module=privval err=EOF
Nov 24 22:52:08 val2 kernel: usb 1-3: reset full-speed USB device number 23 using xhci_hcd
Nov 24 22:52:08 val2 kernel: usb 1-10: reset full-speed USB device number 5 using xhci_hcd
Nov 24 22:52:09 val2 kernel: usb 1-10: reset full-speed USB device number 5 using xhci_hcd
Nov 24 22:52:10 val2 gaiad[86004]: E[24116-11-24|21:52:10.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:12 val2 gaiad[86004]: E[24116-11-24|21:52:12.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:14 val2 gaiad[86004]: E[24116-11-24|21:52:14.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:16 val2 gaiad[86004]: E[24116-11-24|21:52:16.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:18 val2 gaiad[86004]: E[24116-11-24|21:52:18.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:20 val2 gaiad[86004]: E[24116-11-24|21:52:20.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:22 val2 gaiad[86004]: E[24116-11-24|21:52:22.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:24 val2 gaiad[86004]: E[24116-11-24|21:52:24.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:26 val2 gaiad[86004]: E[24116-11-24|21:52:26.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:28 val2 gaiad[86004]: E[24116-11-24|21:52:28.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:30 val2 gaiad[86004]: E[24116-11-24|21:52:30.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:32 val2 gaiad[86004]: E[24116-11-24|21:52:32.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:34 val2 gaiad[86004]: E[24116-11-24|21:52:34.401] Ping module=privval err="remote signer timed out"
Nov 24 22:52:36 val2 gaiad[86004]: E[24116-11-24|21:52:36.401] Ping module=privval err="remote signer timed out"
Restarting after the above errors, this also presents itself:
Nov 24 22:57:26 val2 systemd[1]: Started Gaia Service.
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.468] Starting ABCI with Tendermint module=main
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] Starting multiAppConn module=proxy impl=multiAppConn
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] Starting localClient module=abci-client connection=query impl=localClient
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] Starting localClient module=abci-client connection=mempool impl=localClient
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] Starting localClient module=abci-client connection=consensus impl=localClient
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] ABCI Handshake App Info module=consensus height=0 hash= software-version= protocol-version=0
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.480] ABCI Replay Blocks module=consensus appHeight=0 storeHeight=0 stateHeight=0
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.544] Completed ABCI Handshake - Tendermint and App are synced module=consensus appHeight=0 appHash=
Nov 24 22:57:26 val2 gaiad[86067]: I[24116-11-24|21:57:26.544] Starting TCPVal module=privval impl=TCPVal
Nov 24 22:57:29 val2 gaiad[86067]: E[24116-11-24|21:57:29.544] OnStart module=privval err="accept tcp 127.0.0.1:26658: i/o timeout"
Nov 24 22:57:29 val2 gaiad[86067]: ERROR: Error starting private validator client: accept tcp 127.0.0.1:26658: i/o timeout
Nov 24 22:57:29 val2 systemd[1]: gaia.service: Main process exited, code=exited, status=1/FAILURE
Nov 24 22:57:29 val2 systemd[1]: gaia.service: Unit entered failed state.
Nov 24 22:57:29 val2 systemd[1]: gaia.service: Failed with result 'exit-code'.
Nov 24 22:57:32 val2 systemd[1]: gaia.service: Service hold-off time over, scheduling restart.
Nov 24 22:57:32 val2 systemd[1]: Stopped Gaia Service.
Nov 24 22:57:32 val2 systemd[1]: Started Gaia Service.
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.087] Starting ABCI with Tendermint module=main
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] Starting multiAppConn module=proxy impl=multiAppConn
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] Starting localClient module=abci-client connection=query impl=localClient
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] Starting localClient module=abci-client connection=mempool impl=localClient
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] Starting localClient module=abci-client connection=consensus impl=localClient
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] ABCI Handshake App Info module=consensus height=0 hash= software-version= protocol-version=0
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.100] ABCI Replay Blocks module=consensus appHeight=0 storeHeight=0 stateHeight=0
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.163] Completed ABCI Handshake - Tendermint and App are synced module=consensus appHeight=0 appHash=
Nov 24 22:57:33 val2 gaiad[86096]: I[24116-11-24|21:57:33.163] Starting TCPVal module=privval impl=TCPVal
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.124] This node is not a validator module=consensus addr=2369786F94AECAABEE11A1242A395EC9C6303BF9 pubKey=PubKeyEd25519{6C0B225542087B267B312F09424CF9E58C23519F9EC7B85181E036BB8E20E720}
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.127] P2P Node ID module=p2p ID=06430257c53430df262d5010a26175db590b4154 file=/config/node_key.json
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.127] Starting Node module=node impl=Node
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.127] Starting EventBus module=events impl=EventBus
Nov 24 22:57:36 val2 gaiad[86096]: I[24116-11-24|21:57:36.127] Starting PubSub module=pubsub impl=PubSub
I notice that often Tendermint will time out rather quickly when waiting for tmkms, as shown in the above as well.
Thanks for the great work so far.
Known issue, but thanks for reporting it! Here's the relevant Tendermint issue:
Hope it will be repaired before the main network launch
So we can use KMS to build validator with high availability
This was fixed upstream about a month ago, and shouldn't be a problem in e.g. cosmos-sdk v0.29. Are you still experiencing it?
Edit: never mind, I see the discussion on tendermint/tendermint#2923 now.
Note the particular issue in this ticket is unrelated: this issue was specifically about having Tendermint detect that the KMS socket had closed, and automatically reconnecting. The issue on tendermint/tendermint#2923 is error handling.
Yes, for realize HA and prevent Double Sign, i deployed three gaiad with same pubkey, and deployed the corresponding kms for each gaiad. Three kms shares the same "height_current", so once certain height no more than "height_current", will be reject by kms.
In my imagination, the rejected validator will synchronized from the main network. I need to test whether this is feasible.
But, the gaiad cannot reconnect after being rejected by kms(or kms restart, i think is same problem?), so i think it's an important issue. Hope to be repaired before the launch of main network. :)