airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nerve fails to restart on watcher failure

Jaykah opened this issue · comments

Saw a similar topic somewhere, but since the fix has been apparently merged, decided to open a new issue.

I am using a simple mysql check to register the members of a Galera Cluster.

I, [2014-08-30T09:47:26.757807 #40790]  INFO -- Nerve::Reporter::Zookeeper: nerve: successfully created zk connection to x.example.com:2181,x2.example.com:2181,x3.example.com:2181/services/database
I, [2014-08-30T09:47:26.776437 #40790]  INFO -- Nerve::ServiceCheck::MySQLServiceCheck: nerve: service check user@10.1.1.1 initial check returned true
I, [2014-08-30T09:47:26.803240 #40790]  INFO -- Nerve::ServiceWatcher: nerve: service db is now up
I, [2014-08-30T13:58:51.491719 #40790]  INFO -- Nerve::ServiceCheck::MySQLServiceCheck: nerve: service check user@10.1.1.1 got error #<RuntimeError: failed to connect with mysql: ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
>
I, [2014-08-30T14:00:08.381207 #40790]  INFO -- Nerve::ServiceCheck::MySQLServiceCheck: nerve: service check user@10.1.1.1 got error #<RuntimeError: failed to connect with mysql: ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
>
I, [2014-08-30T17:22:00.684380 #40790]  INFO -- Nerve::ServiceCheck::MySQLServiceCheck: nerve: service checkuser@10.1.1.1 got error #<RuntimeError: failed to connect with mysql: ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
>

After which the checks stop, and although the node has already been restored, it fails to register in Zookeeper.

Hi Guys,

I have a similar a problem.

Nerve works very well to unregister an instance with problems (based on health/ping checks), but when this same instance back to work nerve doesn't register this instance in ZK.

If I force a restart in nerve everything works perfectly, but this is not a elegant way to fix the problem.

This should be fixed with 86aa804 and ab1388a which made it so that the nerve process watches the reporters and forces them to start again if they exited with an error, and if we get a zookeeper session expiry we recreate ephemeral nodes as soon as we can re-establish connection.

Please let me know if you are still seeing this issue, and we can re-open and dive into it more.