tendermint / tmkms

Key Management service for Tendermint Validator nodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Network timeouts

tarcieri opened this issue · comments

The Secret Connection implementation in tendermint-rs presently has no notion of timeouts for things like connecting/"dialing" validators or reading from/writing to sockets.

Without timeouts, there are many cases in which the KMS could potentially deadlock indefinitely inside of blocking I/O operations, and at least one user has experienced this.

Here are a few recommendations:

  • futures/tokio: this is the up-and-coming ecosystem solution to this general problem, and the one I'd recommend. It's a full asynchronous event model which solves, among other things, timeouts. When async/await support ships in Rust 1.38 (scheduled to be released in August), migrating from blocking I/O should be fairly straightforward.
  • libc crate + poll(2) system call: if we wanted to stick with blocking I/O, the poll(2) system call, as invoked through the libc crate, can be used to determine I/O readiness prior to performing a blocking I/O call, and also takes a timeout as an argument. This would probably be the lowest impact way to implement timeouts as it wouldn't involve switching away from blocking I/O.
  • Watchdog: the existing threads run in a loop, which can periodically send a heartbeat (via e.g. a channel) to a watchdog thread. The watchdog can kill and restart threads which appear to be deadlocked. I think this might be a good idea in general, but it seems like the nuclear option and it would probably be good only as a last resort, with one of the more graceful approaches above used as the first line of defense. I'm a bit unclear on what the Rust semantics are when e.g. a thread dies while holding a mutex - it may cause a PoisonError which would corrupt the state of the whole program and cause it to crash.

We've seen same behavior in production so would appreciate work on this. :-)

Moving this issue to informalsystems/tendermint-rs#2 since the crate containing the relevant code has been relocated there

reopening here as the relevant code was moved back to the kms

Fixed in v0.7.0 (see #364)