tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tmkms 0.6.3 hang

mdyring opened this issue · comments

On 0.6.3 we just experienced a "hung" tmkms process.

It was solved by a systemd restart tmkms, which tmkms responded to immediately.

Both irishub and cosmoshub-2 validators were affected at the same time.

https://twitter.com/validator_net/status/1173769574661201921?s=20

Any ideas appreciated.

Log from tmkms side:

Sep 17 00:58:37 tmkms[19761]: 00:58:37 [info] [cosmoshub-2@tcp://10.x.x.x:26659] signed PreVote:9721DF8490 at h/r/s 1840247/0/6 (157 ms)
Sep 17 00:58:38 tmkms[19761]: 00:58:38 [info] [irishub@tcp://10.x.x.x:27659] signed PreCommit:5F52EAAB36 at h/r/s 2524094/0/6 (142 ms)
Sep 17 00:58:38 tmkms[19761]: 00:58:38 [info] [cosmoshub-2@tcp://10.x.x.x:26659] signed PreCommit:9721DF8490 at h/r/s 1840247/0/6 (142 ms)
Sep 17 00:58:44 tmkms[19761]: 00:58:44 [info] [cosmoshub-2@tcp://10.x.x.x:26659] signed PreVote:477A62C06A at h/r/s 1840248/0/6 (142 ms)
Sep 17 00:58:44 tmkms[19761]: 00:58:44 [info] [irishub@tcp://10.x.x.x:27659] signed PreVote:B5A0E3A798 at h/r/s 2524095/0/6 (247 ms)
Sep 17 00:58:44 tmkms[19761]: 00:58:44 [info] [irishub@tcp://10.x.x.x:27659] signed PreCommit:B5A0E3A798 at h/r/s 2524095/0/6 (142 ms)
Sep 17 01:05:43 systemd[1]: Stopping tmkms...
Sep 17 01:05:43 systemd[1]: Stopped tmkms.
Sep 17 01:05:43 systemd[1]: Started tmkms.
Sep 17 01:05:43 tmkms[8495]: 01:05:43 [info] tmkms 0.6.3 starting up...
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [keyring:yubihsm] added consensus key cosmosvalconspub1zcjduepqjnnwe2jsywv0kfc97pz04zkm7tc9k2437cde2my3y5js9t7cw9mstfg3sa
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [keyring:yubihsm] added consensus key icp1zcjduepq5herc33r92drzgwjfhjtpxpsp5c2m6n7uj6edemmv8mmh33zs3wqzz9yrn
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [keyring:yubihsm] added consensus key cosmosvalconspub1zcjduepqnjl9z7key970s3am2ehd7ve7xd43k983hl8ga8wg5jkw45star7q86908u
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] KMS node ID: D75EB75242CB4D496B8E984C1D14EA670C762410
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] KMS node ID: DD7036834704E2CFF8C7B35C68F8933D18ECA2E8
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [cosmoshub-2@tcp://10.x.x.x:26659] connected to validator successfully
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [warn] [cosmoshub-2] tcp://10.x.x.x:26659: unverified validator peer ID! (BE094C1DAD534C94BF3A0101A470FC6CB39AFF74)
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [irishub@tcp://10.x.x.x:27659] connected to validator successfully
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [warn] [irishub] tcp://10.x.x.x:27659: unverified validator peer ID! (3E323B8E5237E0F5498493AD4BC6A357AA98CD3F)
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [irishub@tcp://10.x.x.x:27659] signed PreCommit:<nil> at h/r/s 2524143/0/6 (112 ms)
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [cosmoshub-2@tcp://10.x.x.x:26659] signed PreCommit:<nil> at h/r/s 1840288/0/6 (224 ms)
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [irishub@tcp://10.x.x.x:27659] signed PreCommit:<nil> at h/r/s 2524144/1/6 (112 ms)
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [cosmoshub-2@tcp://10.x.x.x:26659] signed PreCommit:<nil> at h/r/s 1840289/0/6 (112 ms)
Sep 17 01:05:44 tmkms[8495]: 01:05:44 [info] [irishub@tcp://10.x.x.x:27659] signed PreCommit:<nil> at h/r/s 2524145/0/6 (113 ms)
Sep 17 01:05:45 tmkms[8495]: 01:05:45 [info] [cosmoshub-2@tcp://10.x.x.x:26659] signed PreCommit:<nil> at h/r/s 1840290/0/6 (112 ms)
Sep 17 01:05:45 tmkms[8495]: 01:05:45 [info] [irishub@tcp://10.x.x.x:27659] signed PreCommit:<nil> at h/r/s 2524146/0/6 (112 ms)
Sep 17 01:05:45 tmkms[8495]: 01:05:45 [info] [cosmoshub-2@tcp://10.x.x.x:26659] signed PreCommit:<nil> at h/r/s 1840291/0/6 (112 ms)
Sep 17 01:05:45 tmkms[8495]: 01:05:45 [info] [irishub@tcp://10.x.x.x:27659] signed PreCommit:<nil> at h/r/s 2524147/0/6 (152 ms)
Sep 17 01:05:45 tmkms[8495]: 01:05:45 [info] [cosmoshub-2@tcp://10.x.x.x:26659] signed PreCommit:<nil> at h/r/s 1840292/0/6 (112 ms)
Sep 17 01:05:45 tmkms[8495]: 01:05:45 [info] [irishub@tcp://10.x.x.x:27659] signed PreCommit:<nil> at h/r/s 2524148/0/6 (174 ms)

Validator side:

Sep 17 00:58:38 iris[136308]: I[2019-09-17|02:58:38.587] Executed block                               module=state height=2524094 validTxs=0 invalidTxs=0
Sep 17 00:58:38 iris[136308]: I[2019-09-17|02:58:38.601] Committed state                              module=state height=2524094 txs=0 appHash=D9D5336E7C36CB61751632B94A7BC13D7E5A547FF9A95ACFF8AAFF9745C282A5
Sep 17 00:58:38 gaiad[158813]: I[2019-09-17|02:58:38.964] Executed block                               module=state height=1840247 validTxs=0 invalidTxs=0
Sep 17 00:58:38 gaiad[158813]: I[2019-09-17|02:58:38.985] Committed state                              module=state height=1840247 txs=0 appHash=5CDC2853C1F460A19973AD06F56D4895227EAAFED98EBEDFB2C67304C1C7326A
Sep 17 00:58:45 iris[136308]: I[2019-09-17|02:58:45.547] Executed block                               module=state height=2524095 validTxs=0 invalidTxs=0
Sep 17 00:58:45 iris[136308]: I[2019-09-17|02:58:45.562] Committed state                              module=state height=2524095 txs=0 appHash=C1045F35A8B7180BAA21BAB36C54C940C5C9FF921ACF64DC3B3363705E55E6BC
Sep 17 00:58:48 gaiad[158813]: E[2019-09-17|02:58:48.149] Error signing vote                           module=consensus height=1840248 round=0 vote="Vote{94:EE73A19751D5 1840248/00/2(Precommit) 477A62C06A1E 000000000000 @ 2019-09-17T00:58:45.149701192Z}" err="remote sig
Sep 17 00:58:48 gaiad[158813]: I[2019-09-17|02:58:48.268] Executed block                               module=state height=1840248 validTxs=0 invalidTxs=0
Sep 17 00:58:48 gaiad[158813]: I[2019-09-17|02:58:48.288] Committed state                              module=state height=1840248 txs=0 appHash=7EE1EF6FDC193C4598762A7305BF3BA1C2CE8719A682DDE1CC9FB6DCE5626C81
Sep 17 00:58:51 gaiad[158813]: E[2019-09-17|02:58:51.150] Ping                                         module=privval err="remote signer timed out"
Sep 17 00:58:53 iris[136308]: E[2019-09-17|02:58:53.686] Error signing vote                           module=consensus height=2524096 round=0 vote="Vote{54:A2C16A6BDF92 2524096/00/1(Prevote) A6CC1012B493 000000000000 @ 2019-09-17T00:58:50.633711684Z}" err="remote signer
Sep 17 00:58:54 gaiad[158813]: E[2019-09-17|02:58:54.150] Reconnecting to remote signer failed         module=privval err="accept tcp [::]:26659: i/o timeout"
Sep 17 00:58:54 gaiad[158813]: E[2019-09-17|02:58:54.150] Ping                                         module=privval err="remote signer timed out"
Sep 17 00:58:54 gaiad[158813]: E[2019-09-17|02:58:54.150] error closing socket val connection during reset module=privval err="close tcp 10.x.x.x:26659->10.y.y.y:53510: use of closed network connection"
Sep 17 00:58:56 iris[136308]: E[2019-09-17|02:58:56.686] Ping                                         module=privval err="remote signer timed out"
Sep 17 00:58:57 gaiad[158813]: E[2019-09-17|02:58:57.151] Reconnecting to remote signer failed         module=privval err="accept tcp [::]:26659: i/o timeout"
Sep 17 00:58:57 gaiad[158813]: E[2019-09-17|02:58:57.151] Ping                                         module=privval err="remote signer timed out"
Sep 17 00:58:57 gaiad[158813]: E[2019-09-17|02:58:57.151] error closing socket val connection during reset module=privval err="close tcp 10.x.x.x:26659->10.y.y.y:53510: use of closed network connection"
Sep 17 00:58:59 iris[136308]: E[2019-09-17|02:58:59.686] Error signing vote                           module=consensus height=2524096 round=0 vote="Vote{54:A2C16A6BDF92 2524096/00/2(Precommit) A6CC1012B493 000000000000 @ 2019-09-17T00:58:56.68665161Z}" err="remote signe
Sep 17 00:58:59 iris[136308]: I[2019-09-17|02:58:59.795] Executed block                               module=state height=2524096 validTxs=0 invalidTxs=0
Sep 17 00:58:59 iris[136308]: I[2019-09-17|02:58:59.810] Committed state                              module=state height=2524096 txs=0 appHash=3644ACF700B91C02A1D2B7CCA6617D11457A0A25D092235F449249060C891BA0
Sep 17 00:59:00 gaiad[158813]: E[2019-09-17|02:59:00.151] Reconnecting to remote signer failed         module=privval err="accept tcp [::]:26659: i/o timeout"
Sep 17 00:59:00 gaiad[158813]: E[2019-09-17|02:59:00.151] Ping                                         module=privval err="remote signer timed out"
Sep 17 00:59:00 gaiad[158813]: E[2019-09-17|02:59:00.151] Error signing vote                           module=consensus height=1840249 round=0 vote="Vote{94:EE73A19751D5 1840249/00/1(Prevote) 3E839F2D8CB6 000000000000 @ 2019-09-17T00:58:57.151358768Z}" err="remote signe
Sep 17 00:59:00 gaiad[158813]: E[2019-09-17|02:59:00.151] error closing socket val connection during reset module=privval err="close tcp 10.x.x.x:26659->10.y.y.y:53510: use of closed network connection"
Sep 17 00:59:02 iris[136308]: E[2019-09-17|02:59:02.687] Reconnecting to remote signer failed         module=privval err="accept tcp [::]:27659: i/o timeout"
Sep 17 00:59:02 iris[136308]: E[2019-09-17|02:59:02.687] Ping                                         module=privval err="remote signer timed out"
Sep 17 00:59:02 iris[136308]: E[2019-09-17|02:59:02.687] error closing socket val connection during reset module=privval err="close tcp 10.x.x.x:27659->10.y.y.y:40606: use of closed network connection"
Sep 17 00:59:03 gaiad[158813]: E[2019-09-17|02:59:03.152] Reconnecting to remote signer failed         module=privval err="accept tcp [::]:26659: i/o timeout"
Sep 17 00:59:03 gaiad[158813]: E[2019-09-17|02:59:03.152] Ping                                         module=privval err="remote signer timed out"
Sep 17 00:59:03 gaiad[158813]: E[2019-09-17|02:59:03.152] Error signing vote                           module=consensus height=1840249 round=0 vote="Vote{94:EE73A19751D5 1840249/00/2(Precommit) 3E839F2D8CB6 000000000000 @ 2019-09-17T00:59:03.152323142Z}" err="remote sig
Sep 17 00:59:03 gaiad[158813]: E[2019-09-17|02:59:03.152] error closing socket val connection during reset module=privval err="close tcp 10.x.x.x:26659->10.y.y.y:53510: use of closed network connection"
Sep 17 00:59:03 gaiad[158813]: I[2019-09-17|02:59:03.253] Executed block                               module=state height=1840249 validTxs=1 invalidTxs=0
Sep 17 00:59:03 gaiad[158813]: I[2019-09-17|02:59:03.275] Committed state                              module=state height=1840249 txs=1 appHash=19BE6F13A2A089B2450D85F53FBC2DA8F2B95ECC4F352114EFB97FB7F4C418FA

@mdyring that's the most likely explanation. After some discussion on interchainio/tendermint-rs, we'll be moving the Rust implementation of Secret Connection back into the tendermint/kms repo, at which point we can start playing with an implementation that supports Rust's async/await functionality.

That would be greatly appreciated. :-)

While I love kms, I am worried this missing piece makes it fragile.

I experienced the same situation today.
tmkms hung without any error log.
I looked at the system log but there is no information to help. sorry~

Note that Rust's async / await features will be stabilizing in the next major release after the current one (1.39) and, as such, I'll soon be looking in migrating the Secret Connection implementation used by the KMS to leverage it.

Upstream in https://github.com/interchainio/tendermint-rs/ we've decided to move the Secret Connection implementation used by KMS back into this repository, which should make it much easier to start playing with those features.

First steps towards a proper async timeout implementation on #365

Looking forward to that async implementation as we've just experience this issue again today.

https://twitter.com/validator_net/status/1192247910035083264?s=20

What would be the best way to ensure this is the root cause? I am not seeing any networking related events in our monitoring to explain why a TCP connection between the KMS and validator would suddenly die, if anything a retransmit should fix it.

Would a core dump be useful to get some stack traces of the hung state next time? If yes, let me know if interested and what would be best way to accomplish this on Rust.

@mdyring you can try the latest master and see if it helps.

Separately I've been meaning to cut a prerelease of what's on master before we start async work as there are a number of unrelated changes that it'd be nice to be sure did not cause regressions before we start async work.

(BTW: stable async/await support in Rust shipped a few days ago, so we're ready to go on that front)

I believe this is a dup of #310. Please reopen if you still experience these problems with tmkms v0.7.0.