Add a mechanism to automatically take Infinispan backup sites offline on split-brain
ryanemerson opened this issue · comments
Description
Problem
If an exception is thrown on xsite replication errors, as introduced by #531, it's no longer possible for a Keycloak cluster to make progress once a split-brain has occurred as an exception will be thrown on every Infinispan cache operation until the Infinispan backup site has explicitly been taken offline.
Infinispan provides an automatic take-offline configuration for individual caches, however this problematic in practice as:
- A site is only taken offline after a configured number of failures and/or min-wait time has passed between exceptions
- The offline status is updated per Infinispan node only when an entry hosted by that node is affected by a failure. There's currently no mechanism to update the status across all nodes for a given cache.
Given that Keycloak uses a number of different caches and each cache is associated with specific Keycloak functionality, it's possible that it could be a significant period of time after split-brain has occurred before all caches are taken offline.
Solution
Instead of relying on Infinispan's take-offline capabilities, we can enhance the STONITH Lambda required by multi-site deployments so that it explicitly takes offline all caches when a split-brain is detected. The advantage of this approach is that it ensures backup sites are never taken offline until after the Global Accelerator has been updated.
Discussion
No response
Motivation
No response
Details
No response
Thank you for the write-up. I'd suggest to add it more explicitly that the automatic take-offline is per-node AND per-cache, which shows the gravity of the problem.
Once we implement this and switch to FAIL
, the lambda (or something equivalent in other setups) would be a necessary requirement in any setup, as a failover based just on the loadbalancer's capabilities wouldn't work. So the docs would need to be extended as a follow-up with "if you set it to FAIL, you will have to implement the functionality described in the Failover lambda or something equivalent". As not everyone might have those capabilities in their infrastructure (especially if they are not using AWS), this should be recommended, but optional.
cc: @ryanemerson, @kami619, @mhajas
I'd suggest to add it more explicitly that the automatic take-offline is per-node AND per-cache, which shows the gravity of the problem.
I have updated the second bullet point to try to make this clearer.