caio / foca

mirror of https://caio.co/de/foca/

Home Page:https://caio.co/de/foca/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suspect members from saved state won't change state

jeromegn opened this issue · comments

As an optimization, on startup we're applying the last known state of the cluster so it's a lot faster to know a whole cluster when there are hundreds of nodes.

I noticed that if the last saved state was Suspect, and it was applied on start, those members would never go back to a non-suspect state.

As an experiment I filtered out all Suspect members when using apply_many on startup and it appeared to fix it. The members were discovered as Alive again.

Is this the way to prevent this behaviour or is it a bug?

commented

Keeping only Alive is cleaner.

When you load state from an external source the risk is that the state diverges and the updates get disseminated before there's a chance for the new node (with old state) to catch up.

So when you load a Suspect state there's a higher chance that this has transitioned to Down and you missed this update, so you end up thinking a Down node is alive until you eventually ping it (may be a while, given your cluster size and the fact that nobody else in the cluster thinks they are active)

It may also happen that while the Suspect state remains, the node that initiated it went down (or was restarted). So it won't end up declaring the node down. This is perfectly fine.

From a cluster membership perspective Suspect == Alive and users shouldn't be obsessing about the difference. What's important to know is whether the node is still actively participating (probing and being probed periodically, mostly). Foca could do a better job at exposing the precise cycle (I'm thinking new Notifications), but "are we sending and receiving data" is a good enough proxy.

Thanks, that makes sense. I think I'll add a few things to the logic for starting from a saved state, like: don't use an update older than n seconds.