zk auth / framework registration race condition
vixns opened this issue · comments
We are using chronos (master branch) on several clusters. On one of them, we are using zookeeper 3.5 with authentication, and mesos 1.6.2 with SSL and suthentication.
We face a race condition while starting, when the zookeeper auth handshake is not over, chronos does not even try to register with mesos.
We first found that while testing thru a vpn with ~30ms latency beetwen zk, mesos and chronos, but we also have this issue quite often with lower latencies.
A workaround while testing thru the vpn was to add a breakpoint at https://github.com/mesos/chronos/blob/master/src/main/scala/org/apache/mesos/chronos/scheduler/jobs/JobScheduler.scala#L522, wait a few seconds on start for zookeeper auth completion, then resume. This works 100% of the times.
How should it be fixed "the good way" ? ( ensure to wait for zookeeper handshake before starting election ? restart election on zookeeper (re)auth ? ... ? )