mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules

Home Page:http://mesos.github.io/chronos/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

zk auth / framework registration race condition

vixns opened this issue · comments

commented

We are using chronos (master branch) on several clusters. On one of them, we are using zookeeper 3.5 with authentication, and mesos 1.6.2 with SSL and suthentication.
We face a race condition while starting, when the zookeeper auth handshake is not over, chronos does not even try to register with mesos.
We first found that while testing thru a vpn with ~30ms latency beetwen zk, mesos and chronos, but we also have this issue quite often with lower latencies.
A workaround while testing thru the vpn was to add a breakpoint at https://github.com/mesos/chronos/blob/master/src/main/scala/org/apache/mesos/chronos/scheduler/jobs/JobScheduler.scala#L522, wait a few seconds on start for zookeeper auth completion, then resume. This works 100% of the times.

How should it be fixed "the good way" ? ( ensure to wait for zookeeper handshake before starting election ? restart election on zookeeper (re)auth ? ... ? )