kurtome / python-consul-lock

Simple client for distributed locking built on top of python-consul.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High volume / low latency locking is unstable

kurtome opened this issue · comments

Let me preface this issue by saying it could be completely dependent on the our particular installation and usage pattern. We had 3 consul masters, a few dozen consul agents, running consul v0.5.2, and were using this to acquire about dozens of locks a minute which lasted under 2 seconds each.

The main issue we were seeing was leadership handoff between the master nodes about every 2 hours, sometimes more frequently. Which means for about 1 second there was no leader, meaning any attempt to acquire a lock would instantly fail, causing short loss of functionality for our application.

Reading more details from the raft paper and chubby lock paper, it seems to me that short-lived locks are not the intended use case of consul's locking API. Instead this seems to be much more useful for leader election style locks, where one master controls the resource that is being locked.

For simplicity we wound up using our existing Zookeeper cluster and the Kazoo library for locking, since this seemed to be more stable and required no additional maintenance on our part.

@kurtome Consul locking really isn't intended for short lived locks like this. That said, you shouldn't be experiencing such a high level of leadership churn either, and even a few dozen requests per second shouldn't have been an issue. It would be great if you could provide more details on the deployment and issues as a ticket against Consul so that we can investigate.

Also worth noting, other Consul users have had issues with consul leadership election happening frequently, and it seems many performance improvements are on the roadmap for 0.6.0 https://groups.google.com/forum/#!topic/consul-tool/Yp-j7bZYkmI

@armon thanks for the confirmation!

We're not planning to dig into our Consul issues anymore at this very second just because it's working well enough for our other needs (also it's possible the increased load from our usage of this library contributed to the problem), but if we need more I'll open a ticket in the Consul project