Do not change leader when doing reconfiguration
yacovm opened this issue · comments
As the amount of nodes increases, f increases, so the likelihood of having nodes that are unreachable/crashed increases as well.
The leader ID is currently calculated as the index of the node in the position v % n when v is the view ID and n is the node count.
Due to the way we calculate this leader ID, the leader ID might change when we do a reconfiguration that adds or removes nodes.
If we change the leader ID but this node might be offline/unreachable at that time, it means that doing a reconfiguration might cause downtime (view change).
To circumvent this problem, I propose that we persist the leader ID in the view metadata, and:
- Increment modulo n this leader ID when we do a view change
- Persist this leader ID, when we do a reconfiguration, unless the leader itself is removed, and in that case we just pick the next leader ID.