tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Remote signing instability with cross-region usage

tarcieri opened this issue · comments

Quoting @liangping's issues from #204

For some reasons, I have to deploy my validator server(in Tokyo) and my HSM server(in Shanghai) in separate data center for production. My test result is not stable and you can see it on Hubble. and I guess that "ATEAM" has same test result with us

I did not find what cause gaiad stoping signing blocks, actually both gaiad and tmksm are live and "netstate -nat" show connection is till connected. After restart gaiad , It works again. So I thinks it's a problem on gaiad side. I remember that @mdyring reported the similar issue.

But when I put both of them on one server located in Hangzhou, it works very well. here is the result.

@liangping quoting your latency numbers:

[root@localhost ~]# ping 47.245.57.97
PING 47.245.57.97 (47.245.57.97) 56(84) bytes of data.
64 bytes from 47.245.57.97: icmp_seq=1 ttl=50 time=88.1 ms
64 bytes from 47.245.57.97: icmp_seq=2 ttl=50 time=116 ms
64 bytes from 47.245.57.97: icmp_seq=3 ttl=50 time=111 ms
64 bytes from 47.245.57.97: icmp_seq=4 ttl=50 time=99.0 ms
64 bytes from 47.245.57.97: icmp_seq=5 ttl=50 time=109 ms
64 bytes from 47.245.57.97: icmp_seq=6 ttl=50 time=103 ms
64 bytes from 47.245.57.97: icmp_seq=7 ttl=50 time=104 ms
64 bytes from 47.245.57.97: icmp_seq=8 ttl=50 time=106 ms
64 bytes from 47.245.57.97: icmp_seq=9 ttl=50 time=96.0 ms
64 bytes from 47.245.57.97: icmp_seq=10 ttl=50 time=112 ms
64 bytes from 47.245.57.97: icmp_seq=11 ttl=50 time=104 ms
64 bytes from 47.245.57.97: icmp_seq=12 ttl=50 time=105 ms

These look ok albeit somewhat variable. I'm curious if there were latency spikes which correlate to the instability you experienced. If you can reproduce the issue (even operating a testnet) it'd be good to determine if the instability is network-related.

Quoted from #204:

@mdyring are you suggesting to have Ping use the public cloud infrastructure (like AWS) to create private links between regions, and then hook them into our dedicated servers? that's a bit of a hop, but ensures consistency and less noisy networks. having dedicated private links between DCs are quite pricey, not sure whether the stability-to-rewards ratio is worth it?

@jacksteroo Yeah we use AWS Direct Connect to get a private link between sentries in cloud and co-located validator. It is not crazy expensive for our use (Frankfurt/Copenhagen connectivity), but have no idea how it looks in Asia though.

We use this a primary link as it is far more deterministic latency/packet loss wise and then use Internet for backup.

Thanks @mdyring. I was referring to VPC Peering, vs. Direct Connect, I think that has the same effect of blocking out public networks as much as possible. AWS DX is a given, highly recommended. 👍

commented

@tarcieri anyway, I will do more test next week.
I don't use AWS, because AWS do not have DDOS protection by default. and we have to buy expensive protection services. thanks you guys for sharing your experiences.

Ok, will close this out for now but let me know if you can reproduce the issue and I will reopen

commented

HI, @tarcieri ,I ran a single node blockchain and had the same error as before. also validator locate in Tokyo, hsm in Shanghai.

The logs shows "remote signer timed out" on gaiad side.
Gaiad(tendermint) stop sending signing requests after timeout.

I[2019-06-17|23:12:26.688] Executed block                               module=state height=3357 validTxs=0 invalidTxs=0
I[2019-06-17|23:12:26.691] Committed state                              module=state height=3357 txs=0 appHash=352AF2FB01C9A4A190F92CF3E89537251B46B37FBF6E4BC187818B412531F396
I[2019-06-17|23:12:32.663] Executed block                               module=state height=3358 validTxs=0 invalidTxs=0
I[2019-06-17|23:12:32.666] Committed state                              module=state height=3358 txs=0 appHash=7F80BE79B2CDAC06213BCE45CBA04C3513230BF0BC1594FF9AB459D451155431
E[2019-06-17|23:12:41.422] Error signing vote                           module=consensus height=3359 round=0 vote="Vote{0:8E78ECF44814 3359/00/1(Prevote) 23F99E4A240F 0000000000
00 @ 2019-06-17T15:12:38.422657502Z}" err="remote signer timed out"
E[2019-06-17|23:12:44.423] Ping                                         module=privval err="remote signer timed out"
E[2019-06-17|23:12:47.600] Couldn't connect to any seeds                module=p2p 
E[2019-06-17|23:13:17.598] Couldn't connect to any seeds                module=p2p 
E[2019-06-17|23:13:47.598] Couldn't connect to any seeds                module=p2p 
E[2019-06-17|23:14:17.598] Couldn't connect to any seeds                module=p2p 
E[2019-06-17|23:14:47.600] Couldn't connect to any seeds                module=p2p 

It seems HSM is still waiting for block commit from gaiad. but can not received it anymore. unless restart gaiad.

00:35:21 [DEBUG] tmkms::session: started handling request ... 
00:35:23 [DEBUG] tmkms::session: replying with PingResponse
00:35:23 [DEBUG] tmkms::session: ... success handling request
00:35:23 [DEBUG] tmkms::session: started handling request ... 
00:35:25 [DEBUG] tmkms::session: replying with PingResponse
00:35:25 [DEBUG] tmkms::session: ... success handling request
00:35:25 [DEBUG] tmkms::session: started handling request ... 
00:35:27 [DEBUG] tmkms::session: replying with PingResponse
00:35:27 [DEBUG] tmkms::session: ... success handling request 

@liangping it seems like gaiad stops sending signing requests if any of them time out, which to me seems pretty brittle.

You might ask @liamsi about this, or follow up on some of these Tendermint issues which I thought were supposed to address this (perhaps they have, but have not shipped in a gaiad release, or your gaiad is too old):

tendermint/tendermint#2923

Regardless you probably don't want signing operations to time out. It sounds like you may be experiencing periods of elevated latency between your datacenters. I'd personally suggest either colocating KMS closer to your validators, or ensuring you have a high-speed private network between datacenter facilities (or cloud interconnect) for where your validators are running.

commented

Thanks for helping!
I reproduce this issue for finding what 's exactly happened.