Review Locking in SimpleCache
sschepens opened this issue · comments
SimpleCache
uses a ReentrantReadWriteLock
for guarding Snapshot map and StatusInfo access, in a scenario where a server has thousands of clients and updates being made, this proves a bottleneck.
We measured our setSnapshot time to sometimes take up to 300ms, this could be, because other createWatches
are locking and preventing this to happen, or this could be real time taken because we do have about 500k streams per control plane server. If the latter is true, no watches can be created for 300ms because the write lock is taken, this is definitely not good.
I imagine the most needed synchronisation need is to prevent watches from being created while a new snapshot is being set and being left with an old version. Though the current locking scheme is much more restrictive, one cannot even create two watches concurrently.
We should take a look at this and try to handle it a better way.
Some ideas on how to improve this:
-
CacheStatusInfo
is internally guarded by a lock that doesn't seem very necessary, we should drop it and use a ConcurrentMap. -
SimpleCache
is the most tricky, but we could:- Use a
readLock
to guardcreateWatch
since we don't really mind watches being created concurrently, this is kind of counter intuitive since we're using a read lock but actually writing data, but i've seen this kind of use before. Then usewriteLock
only insetSnapshot
. This strategy would improve things a bit, but it doesn't really solve our problem. - We could probably readLock entire
createWatch
butwriteLock
onlysnapshots.put
onsetSnapshot
. I have to think about the implications of this a little more, but it would seem that if we do that, there would be no conflicts sincecreateWatch
would always have a consistent view ofsnapshots
. Currently creating watches would see and old version of snapshot but manage to create it's watches beforesetSnapshot
can lock and insert the new snapshot, after that, all new watches actually see that version and should have no problemas, whilesetSnapshot
continues to iterate through watches and responding accordingly. - We could almost remove all locks and provide another way to ensure that watches are being responded to correctly when new snapshots are being created concurrently. Maybe make
createWatch
double check if a new snapshot has been created.
- Use a
More ideas are welcome, but it would seem this locking mechanism should be reworked to allow better performance, we're not even stressing our cpus right now.
Edit: Another idea would be to not drop watches everytime we respond to them, this would decrease the lock contention a lot, has anyone thought about this before?
PR is merged so I'm closing this, please reopen if I'm missing something.