ZK connectivity failure with multiple watchers leads to permanent failure

Question

ZK connectivity failure with multiple watchers leads to permanent failure

minkovich opened this issue 7 years ago · comments

Setup:
6 nerve service watchers on the same instance connected to the same ZK pool

How to reproduce:

The instance has a problem connecting to ZK
Nerve ->
Nerve::Nerve: nerve: watcher service1 not alive; reaping and relaunching
Nerve::ServiceWatcher: nerve: stopping service watch service1
Nerve::Nerve: nerve: could not reap service1, got #<Zookeeper::Exceptions::NotConnected: Zookeeper::Exceptions::NotConnected>
This continues in a loop for each service watcher until nerve is restarted.

Actual problem:
The problem is that in start() in zookeeper.rb there are no checks to see if the ZK connection is alive before re-using in.

Joseph Lynch · Answer 1 · Thu Apr 20 2017 06:24:16 GMT+0800 (China Standard Time)

@minkovich what is your desired behavior here? I suppose that we would like it if Nerve threw out the bad cached connection and tried again?

If the cluster is just not reachable this would lead to a similar infinite retry loop, but perhaps crash-recover is sufficient here?

minkovich · Answer 2 · Thu Apr 27 2017 21:48:54 GMT+0800 (China Standard Time)

@jolynch The efficient solution would be for nerve to throw away the bad connection, but honestly in this situation a crash recovery would also be equivalent since connectivity was already lost.

Rushy R. Panchal · Answer 3 · Thu Feb 13 2020 10:07:59 GMT+0800 (China Standard Time)

Closed because this was fixed in #113.