envoyproxy / java-control-plane

Java implementation of an Envoy gRPC control plane

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Review Locking in SimpleCache

sschepens opened this issue · comments

SimpleCache uses a ReentrantReadWriteLock for guarding Snapshot map and StatusInfo access, in a scenario where a server has thousands of clients and updates being made, this proves a bottleneck.

We measured our setSnapshot time to sometimes take up to 300ms, this could be, because other createWatches are locking and preventing this to happen, or this could be real time taken because we do have about 500k streams per control plane server. If the latter is true, no watches can be created for 300ms because the write lock is taken, this is definitely not good.

I imagine the most needed synchronisation need is to prevent watches from being created while a new snapshot is being set and being left with an old version. Though the current locking scheme is much more restrictive, one cannot even create two watches concurrently.

We should take a look at this and try to handle it a better way.

Some ideas on how to improve this:

  • CacheStatusInfo is internally guarded by a lock that doesn't seem very necessary, we should drop it and use a ConcurrentMap.

  • SimpleCache is the most tricky, but we could:

    • Use a readLock to guard createWatch since we don't really mind watches being created concurrently, this is kind of counter intuitive since we're using a read lock but actually writing data, but i've seen this kind of use before. Then use writeLock only in setSnapshot. This strategy would improve things a bit, but it doesn't really solve our problem.
    • We could probably readLock entire createWatch but writeLock only snapshots.put on setSnapshot. I have to think about the implications of this a little more, but it would seem that if we do that, there would be no conflicts since createWatch would always have a consistent view of snapshots. Currently creating watches would see and old version of snapshot but manage to create it's watches before setSnapshot can lock and insert the new snapshot, after that, all new watches actually see that version and should have no problemas, while setSnapshot continues to iterate through watches and responding accordingly.
    • We could almost remove all locks and provide another way to ensure that watches are being responded to correctly when new snapshots are being created concurrently. Maybe make createWatch double check if a new snapshot has been created.

More ideas are welcome, but it would seem this locking mechanism should be reworked to allow better performance, we're not even stressing our cpus right now.

Edit: Another idea would be to not drop watches everytime we respond to them, this would decrease the lock contention a lot, has anyone thought about this before?

PR is merged so I'm closing this, please reopen if I'm missing something.