Is this good for locks for mutual exclusion?

Question

Is this good for locks for mutual exclusion?

lovebes opened this issue 5 years ago · comments

I was wondering if you can share your input in how you would recommend using this solution. Would you recommend it on a scaled multi-replicated service architecture?
Here's how we use it, in Kubernetes terms:
multi-pod => Redis based locking via this repo to access a resource. However, each endpoint-triggered call might take close to 10 seconds. We use the default retry delay (100msec) in the repo.

The calls use up a lot of resources, and sometimes it might trigger a long garbage collection. It uses standard GC behavior for Golang. We also use default clock behavior in linux environment, and there's no guarantee that the clock doesn't jump or is accurate all the time. There's an average of 30 calls / sec to these endpoints.

We are seeing only a number of the calls are successfully obtaining the lock. No matter the load during load testing, that number doesn't change.

I researched around, and found the following posts around Redis-based locking mechanisms.

https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
http://antirez.com/news/77

It seems RedLock algorithm was built to make Redis-based locking more reliable, after observing solutions like redis-lock (this repo) being the norm. However, Markin Kleppmann flat-out disagrees and states that Redis-based locking is fundamentally flawed.

I'd like to ask your opinion in this insight, and perhaps an advice on how better to use your repo.

Thank you!

Dimitrij Denissenko · Answer 1 · Fri Mar 01 2019 19:14:07 GMT+0800 (China Standard Time)

@lovebes thanks for the detailed post but I - unfortunately - have no strong opinions here. We are using redis-lock in a scenario where we need to elect masters among our worker processes for jobs that need to run exclusively. We never experienced any problems with it and our monitors suggest that - over the past few years - all exclusive jobs have actually been run exclusively, we cannot see a job that has been run 2x by accident. That said, it wouldn't be the end of the world for us and if it was a critical requirement, I would probably spend more time researching - which is what you seem to be doing.

Salvatore has replied to Martin Kleppmann's criticism in his own blog post: http://antirez.com/news/101, but as said, I haven't been following it too closely.

One thing you may also want to look into is https://www.consul.io/docs/guides/semaphore.html