airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nerve can silently stop watching a service!

memory opened this issue · comments

Discovered while doing a general browse of some logfiles:

I, [2014-01-03T20:29:35.954755 #2900]  INFO -- Nerve::Nerve: nerve: starting up!
I, [2014-01-03T20:29:35.954903 #2900]  INFO -- Nerve::Nerve: nerve: starting run
I, [2014-01-03T20:29:35.955280 #2900]  INFO -- Nerve::ServiceWatcher: nerve: starting service watch ministry-worker
I, [2014-01-03T20:29:35.955376 #2900]  INFO -- Nerve::Reporter: nerve: waiting to connect to zookeeper at 1.zookeeper.qa1.climate.net:2181,0.zookeeper.qa1.climate.net:2181,2.zookeeper.qa1.climate.net:2181/nerve/services/qa1/ministry-worker/default
I, [2014-01-03T20:29:36.015104 #2900]  INFO -- Nerve::Reporter: nerve: successfully created zk connection to 1.zookeeper.qa1.climate.net:2181,0.zookeeper.qa1.climate.net:2181,2.zookeeper.qa1.climate.net:2181/nerve/services/qa1/ministry-worker/default
I, [2014-01-03T20:29:36.020927 #2900]  INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-i-f76498d9.us-east-1.qa1.climate.net:13050/ initial check returned false
I, [2014-01-03T20:30:08.431404 #2900]  INFO -- Nerve::ServiceCheck::HttpServiceCheck: nerve: service check http-i-f76498d9.us-east-1.qa1.climate.net:13050/ transitions to up after 3 successes
I, [2014-01-03T20:30:08.439403 #2900]  INFO -- Nerve::ServiceWatcher: nerve: service ministry-worker is now up
E, [2014-01-03T20:44:13.056572 #2900] ERROR -- Nerve::ServiceWatcher: nerve: error in service watcher ministry-worker: response for meth: :exists, args: [425, "/", nil, nil], not received within 30 seconds
I, [2014-01-03T20:44:13.056680 #2900]  INFO -- Nerve::ServiceWatcher: nerve: ending service watch ministry-worker

...and then nerve apparently stops probing the service state for as long as the nerve process remains up. Digging into this, the issue seems to be some sort of hard-coded timeout in the zookeeper gem, in continuation.rb:

    107       @mutex.synchronize do
    108         while true
    109           now = Time.now
    110           break if @rval or @error or (now > time_to_stop)
    111
    112           deadline = time_to_stop.to_f - now.to_f
    113           @cond.wait(deadline)
    114         end
    115
    116         if (now > time_to_stop) and !@rval and !@error
    117           raise Exceptions::ContinuationTimeoutError, "response for meth: #{meth.inspect}, args: #{@args.inspect}, not received within #{OPERATION_TIMEOU    117 T} seconds"
    118         end
    119
    120         case @error
    121         when nil
    122           # ok, nothing to see here, carry on
    123         when :shutdown
    124           raise Exceptions::NotConnected, "the connection is shutting down"
    125         when ZOO_EXPIRED_SESSION_STATE
    126           raise Exceptions::SessionExpired, "connection has expired"
    127         else
    128           raise Exceptions::NotConnected, "connection state is #{STATE_NAMES[@error]}"
    129         end
    130
    131         case @rval.length
    132         when 1
    133           return @rval.first
    134         else
    135           return @rval
    136         end
    137       end
    138     end

The good news, such as it is, is that if this happens while the node is healthy, nerve will "fail open" -- the node will continue to be registered in ZK, and synapse clients will still see it and route traffic to it if it passes /ping. The bad news is that if this happens while the node is unhealthy, it will stay that way forever from synapse's perspective, until the nerve process is manually restarted.

I am playing with a code change to simply have the nerve process abort if any of the service watchers die in this fashion (we run nerve under supervise here, so it would automatically be restarted in that case), but would like some feedback on what you think the proper way to handle this before submitting a pull request.

See #37

This can be closed now yeah?

Correct, this appears fixed.