soabase / exhibitor

ZooKeeper co-process for instance monitoring, backup/recovery, cleanup and visualization.

Home Page:https://groups.google.com/forum/#!topic/exhibitor-users/PVkcd88mk8c

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hanged Zookeeper process not restarted/killed

erikflinck opened this issue · comments

Hi it seems that that Zookeeper processes that are hanged and consuming a lot of resouces does not get handles by Exhibitor. For some reason jps does not find the process id. I tried running jps

I'm running version 1.5.6

It just keeps going through this and does not succeed:

Mar 29 12:25:09 ip-172-26-108-86 sh: INFO com.netflix.exhibitor.core.activity.ActivityLog Cleanup task completed [pool-3-thread-6346]
Mar 29 12:25:27 ip-172-26-108-86 sh: INFO com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper down/not-serving waiting 30102 of 40000 ms before restarting [ActivityQueue-0]
Mar 29 12:25:57 ip-172-26-108-86 sh: INFO com.netflix.exhibitor.core.activity.ActivityLog Restarting down/not-serving ZooKeeper after 60307 ms pause [ActivityQueue-0]
Mar 29 12:25:57 ip-172-26-108-86 sh: INFO com.netflix.exhibitor.core.activity.ActivityLog Attempting to stop instance [ActivityQueue-0]
Mar 29 12:25:57 ip-172-26-108-86 sh: INFO com.netflix.exhibitor.core.activity.ActivityLog Attempting to start/restart ZooKeeper [ActivityQueue-0]
Mar 29 12:25:57 ip-172-26-108-86 sh: INFO com.netflix.exhibitor.core.activity.ActivityLog jps didn't find instance - assuming ZK is not running [ActivityQueue-0]
Mar 29 12:25:57 ip-172-26-108-86 sh: INFO com.netflix.exhibitor.core.activity.ActivityLog Process started via: /opt/zookeeper/bin/zkServer.sh [ActivityQueue-0]
Mar 29 12:25:57 ip-172-26-108-86 sh: ERROR com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper Server: JMX enabled by default [pool-3-thread-6346]
Mar 29 12:25:57 ip-172-26-108-86 sh: ERROR com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper Server: Using config: /opt/zookeeper/bin/../conf/zoo.cfg [pool-3-thread-6346]
Mar 29 12:25:57 ip-172-26-108-86 sh: INFO com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper Server: Starting zookeeper ... already running as process 24274. [pool-3-thread-6347]
Mar 29 12:26:27 ip-172-26-108-86 sh: INFO com.netflix.exhibitor.core.activity.ActivityLog ZooKeeper down/not-serving waiting 30022 of 40000 ms before restarting [ActivityQueue-0]

I'm going to close the issue for now but if you still have this problem with newest version, please comment/re-open.

If the problem persists I need the following info:

  1. What kind of setup do you have for Zookeeper?
  2. Can you show the jps output when you run it manually?

For the record, what exhibitor does is try to find the pid by checking for QuorumPeerMain in the output of jps line by line. If this is not the case, it assumes ZK is down.
Relevant code from dev:
https://github.com/soabase/exhibitor/blob/master/exhibitor-core/src/main/java/com/netflix/exhibitor/core/processes/StandardProcessOperations.java#L178

I'm facing the same behavior with exhibitor-1.7.1 and zookeeper-3.4.12, any clue?