tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Restarting tmkms leads to Tendermint "remote signer timed out"

ljk662 opened this issue · comments

This is a duplicate of the closed issue 116.

I am still experiencing this issue running gaiad v.2.0.3 with tendermint/kms v0.7.0.

After restarting my remote signer I see the following in my logs.

E[2019-12-29|19:14:16.928] SignerListener: Ping timeout                 module=privval
I[2019-12-29|19:14:17.473] Timed out                                    module=consensus dur=4.96053053s height=133946 round=0 step=RoundStepNewHeight
E[2019-12-29|19:14:19.929] SignerListener: Ping timeout                 module=privval
E[2019-12-29|19:14:22.929] SignerClient::GetPubKey                      module=privval err="endpoint connection timed out"
E[2019-12-29|19:14:22.929] CONSENSUS FAILURE!!!                         module=consensus err="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 976 [running]:\nruntime/debug.Stack(0xc000eb9358, 0x102fa00, 0x1d768e0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9d\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine.func2(0xc000074700, 0x138f910)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:602 +0x57\npanic(0x102fa00, 0x1d768e0)\n\t/usr/local/go/src/runtime/panic.go:679 +0x1b2\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).signAddVote(0xc000074700, 0x20b02, 0xc003f97e00, 0x20, 0x20, 0x1, 0xc003f97e20, 0x20, 0x20, 0xa)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1745 +0x585\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).enterPrecommit(0xc000074700, 0x20b3a, 0x0)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1155 +0x12ec\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc000074700, 0xc000d8b400, 0xc000a574d0, 0x28, 0xc000eb9a38, 0xbff162, 0x0)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1663 +0xdfa\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc000074700, 0xc000d8b400, 0xc000a574d0, 0x28, 0x4922ea91d59af208, 0x109, 0x101)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:1527 +0x59\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc000074700, 0x1545be0, 0xc004be4160, 0xc000a574d0, 0x28)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:691 +0x252\ngithub.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc000074700, 0x0)\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:633 +0x6eb\ncreated by github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart\n\t/go/pkg/mod/github.com/tendermint/tendermint@v0.32.7/consensus/state.go:334 +0x13a\n"
I[2019-12-29|19:14:22.929] Stopping baseWAL                             module=consensus wal=/root/.gaiad/data/cs.wal/wal impl=baseWAL
I[2019-12-29|19:14:22.931] Stopping Group                               module=consensus wal=/root/.gaiad/data/cs.wal/wal impl=Group
E[2019-12-29|19:14:25.929] SignerListener: Ping timeout                 module=privval
E[2019-12-29|19:14:28.929] SignerListener: Ping timeout                 module=privval
E[2019-12-29|19:14:31.929] SignerListener: Ping timeout                 module=privval
E[2019-12-29|19:14:34.929] SignerListener: Ping timeout                 module=privval

The node has never recovered from this unless I restart gaiad process.

This is a Tendermint-side issue. The CONSENSUS FAILURE!!! and runtime error: invalid memory address or nil pointer dereference looks particularly bad.

I'd suggest filing an issue at https://github.com/tendermint/tendermint/issues

Looking at it again, perhaps I may have closed this too soon. Can you post the KMS-side logs and we can see if there's any errors/misconfiguration there?

Thanks, Wasn't sure where to report. I will make an issue on the tendermint project as well.

Below are the logs during the same time I received a consensus failure on the node:

tmkms[1]: 11:22:33 [debug] [cosmoshub-3:tcp://url.com] received request: ShowPublicKey(PubKeyRequest)
tmkms[1]: 11:22:33 [debug] [cosmoshub-3:tcp://url.com] sending response: PublicKey(PubKeyResponse { pub_key_ed25519: [45, 185, 58, ...
tmkms[1]: 11:22:33 [debug] cosmoshub-3: Connecting to tcp://url.com
tmkms[1]: 11:22:33 [info] KMS node ID: XXXXX
tmkms[1]: 11:22:33 [debug] cosmoshub-3: Connecting to tcp://url.com
tmkms[1]: 11:22:33 [info] KMS node ID: XXXXX

And my signer config

[[chain]]
id = "cosmoshub-3"
key_format = { type = "bech32", account_key_prefix = "cosmospub", consensus_key_prefix = "cosmosvalconspub" }

[[validator]]
addr = "tcp://signer.cosmos:12345"
chain_id = "cosmoshub-3"
reconnect = true
secret_key = "./secret_connection.key"

[[providers.yubihsm]]
adapter = { type = "XXX" }
auth = { key = X, password = "XXXXX" }
keys = [{ chain_ids = ["cosmoshub-3"], key = X }]

No obvious culprits there, so yeah, I'd suggest filing a Tendermint issue