Network timeouts
tarcieri opened this issue · comments
The Secret Connection implementation in tendermint-rs presently has no notion of timeouts for things like connecting/"dialing" validators or reading from/writing to sockets.
Without timeouts, there are many cases in which the KMS could potentially deadlock indefinitely inside of blocking I/O operations, and at least one user has experienced this.
Here are a few recommendations:
futures
/tokio
: this is the up-and-coming ecosystem solution to this general problem, and the one I'd recommend. It's a full asynchronous event model which solves, among other things, timeouts. Whenasync
/await
support ships in Rust 1.38 (scheduled to be released in August), migrating from blocking I/O should be fairly straightforward.libc
crate +poll(2)
system call: if we wanted to stick with blocking I/O, thepoll(2)
system call, as invoked through thelibc
crate, can be used to determine I/O readiness prior to performing a blocking I/O call, and also takes a timeout as an argument. This would probably be the lowest impact way to implement timeouts as it wouldn't involve switching away from blocking I/O.- Watchdog: the existing threads run in a loop, which can periodically send a heartbeat (via e.g. a channel) to a watchdog thread. The watchdog can kill and restart threads which appear to be deadlocked. I think this might be a good idea in general, but it seems like the nuclear option and it would probably be good only as a last resort, with one of the more graceful approaches above used as the first line of defense. I'm a bit unclear on what the Rust semantics are when e.g. a thread dies while holding a mutex - it may cause a
PoisonError
which would corrupt the state of the whole program and cause it to crash.
We've seen same behavior in production so would appreciate work on this. :-)
Moving this issue to informalsystems/tendermint-rs#2 since the crate containing the relevant code has been relocated there
reopening here as the relevant code was moved back to the kms
Fixed in v0.7.0 (see #364)