etcd-io / raft

Raft library for maintaining a replicated state machine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lagging follower commit after message drops

pav-kv opened this issue · comments

The test added in #137 demonstrates that a short network blip / message drop can lead to delaying the follower to learn that entries in its log are committed up to HeartbeatInterval + 3/2 * RTT. This is suboptimal, and can be reduced to HeartbeatInterval + 1/2 * RTT.

The reason why it takes an extra roundtrip is that the leader cuts the commit index sent to the follower at Progress.Match, i.e. it never sends a commit index exceeding the known match size of the follower's log. This is to ensure that the follower does not crash at receiving a commit index greater than its log size.

However, the follower can safely process out-of-bounds commit indices by simply (with a caveat: see below) cutting them at the log size on its end. In fact, the follower already does so for the commit indices it receives via MsgApp messages. When sending MsgApp, the leader does not cut the commit index, correspondingly.

Fix

Fixing this is a matter of:

  1. changing one line at the leader send side, and
  2. at the follower receive side, ensuring that the follower's log is a prefix of the leader's.

(2) is true after the follower received at least one successful MsgApp from this leader. From that point, the follower's log is guaranteed to be a prefix of the leader's log, and it is safe to assume that any index <= Commit is committed.

Issue

However, simply doing (1) is unsafe in mixed-version clusters (during a rolling upgrade). If there are followers running old code, they will crash upon seeing a high commit index. To fix the problem completely, there must be a safe migration. Only after all the nodes are running the new code, it is safe to do (1).

Action Items

  • Merge the check (2) for the follower-side: #139
  • Make sure the clusters are running the new code. For example, wait a couple of releases; note in release notes that upgrades should never jump +2 versions.
  • Merge the part (1) and close this issue: #140

Alternative Fix

The same delay reduction can be achieved if the leader sends an empty MsgApp after the HeartbeatInterval. We already have periodic empty MsgApp sends in a throttled state, although they are currently hooked in to MsgHeartbeatResp messages (to ensure MsgApp flow is conditional to some connectivity with the follower), which in this case will result in the same +1 RTT delay.

Isn't this check in place to ensure that followers' logs don't regress? Do we have any other safeguards for that? We don't check that during MsgApp because the follower's log hasn't yet reached the commit index, but once a follower's log has reached the commit index (as recorded via Match) it should not regress because it may be a necessary part of the quorum establishing that commit index.

Isn't this check in place to ensure that followers' logs don't regress?

Yes. Currently, we rely on the fact that the follower's log is in sync with the leader's up to Match, so we can commit up to that. If we send a Commit index > Match, there is a chance that by the time the message arrives the follower either doesn't have this index, or still has some entries from previous terms that are inconsistent with the leader's log.

My proposal is: teach the follower to understand that it matches the leader's log. Then this safety check can be done at the follower end on receiving Commit index, rather than on the leader when sending it. The advantage of this: the follower knowns that it matches the leader 1/2 RTT earlier than the leader, so there is a potential to make follower Commit index progress faster by 1 RTT in some cases.