Add a mechanism to automatically take Infinispan backup sites offline on split-brain

Question

Add a mechanism to automatically take Infinispan backup sites offline on split-brain

ryanemerson opened this issue 3 months ago · comments

Description

Problem

If an exception is thrown on xsite replication errors, as introduced by #531, it's no longer possible for a Keycloak cluster to make progress once a split-brain has occurred as an exception will be thrown on every Infinispan cache operation until the Infinispan backup site has explicitly been taken offline.

Infinispan provides an automatic take-offline configuration for individual caches, however this problematic in practice as:

A site is only taken offline after a configured number of failures and/or min-wait time has passed between exceptions
The offline status is updated per Infinispan node only when an entry hosted by that node is affected by a failure. There's currently no mechanism to update the status across all nodes for a given cache.

Given that Keycloak uses a number of different caches and each cache is associated with specific Keycloak functionality, it's possible that it could be a significant period of time after split-brain has occurred before all caches are taken offline.

Solution

Instead of relying on Infinispan's take-offline capabilities, we can enhance the STONITH Lambda required by multi-site deployments so that it explicitly takes offline all caches when a split-brain is detected. The advantage of this approach is that it ensures backup sites are never taken offline until after the Global Accelerator has been updated.

Discussion

No response

Motivation

No response

Details

No response

Alexander Schwartz · Answer 1 · Wed Aug 14 2024 23:16:22 GMT+0800 (China Standard Time)

Thank you for the write-up. I'd suggest to add it more explicitly that the automatic take-offline is per-node AND per-cache, which shows the gravity of the problem.

Once we implement this and switch to FAIL, the lambda (or something equivalent in other setups) would be a necessary requirement in any setup, as a failover based just on the loadbalancer's capabilities wouldn't work. So the docs would need to be extended as a follow-up with "if you set it to FAIL, you will have to implement the functionality described in the Failover lambda or something equivalent". As not everyone might have those capabilities in their infrastructure (especially if they are not using AWS), this should be recommended, but optional.

cc: @ryanemerson, @kami619, @mhajas

Ryan Emerson · Answer 2 · Thu Aug 15 2024 00:26:05 GMT+0800 (China Standard Time)

I'd suggest to add it more explicitly that the automatic take-offline is per-node AND per-cache, which shows the gravity of the problem.

I have updated the second bullet point to try to make this clearer.