airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ZK connectivity failure with multiple watchers leads to permanent failure

minkovich opened this issue · comments

Setup:
6 nerve service watchers on the same instance connected to the same ZK pool

How to reproduce:

  1. The instance has a problem connecting to ZK
  2. Nerve ->
    Nerve::Nerve: nerve: watcher service1 not alive; reaping and relaunching
    Nerve::ServiceWatcher: nerve: stopping service watch service1
    Nerve::Nerve: nerve: could not reap service1, got #<Zookeeper::Exceptions::NotConnected: Zookeeper::Exceptions::NotConnected>
  3. This continues in a loop for each service watcher until nerve is restarted.

Actual problem:
The problem is that in start() in zookeeper.rb there are no checks to see if the ZK connection is alive before re-using in.

@minkovich what is your desired behavior here? I suppose that we would like it if Nerve threw out the bad cached connection and tried again?

If the cluster is just not reachable this would lead to a similar infinite retry loop, but perhaps crash-recover is sufficient here?

@jolynch The efficient solution would be for nerve to throw away the bad connection, but honestly in this situation a crash recovery would also be equivalent since connectivity was already lost.

Closed because this was fixed in #113.