etcd-io / raft

Raft library for maintaining a replicated state machine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Probe fom index less than match

joshuazh-x opened this issue · comments

I'm learning probe mechanism and notice there is chance for replication starting from an index less than match index in probe state, unlike that in replicate state. For example, a rejected probe response got delayed and is received when the leader falls into another probe state. This is a rare case of cause.

I'm not sure if this is by design. But to my understanding, allowing such a smaller index implies we want to tolerate loss of already replicated log entries.

func (pr *Progress) MaybeDecrTo(rejected, matchHint uint64) bool {
	if pr.State == StateReplicate {
		// The rejection must be stale if the progress has matched and "rejected"
		// is smaller than "match".
		if rejected <= pr.Match {
			return false
		}
		// Directly decrease next to match + 1.
		//
		// TODO(tbg): why not use matchHint if it's larger?
		pr.Next = pr.Match + 1
		return true
	}

	// The rejection must be stale if "rejected" does not match next - 1. This
	// is because non-replicating followers are probed one entry at a time.
	if pr.Next-1 != rejected {
		return false
	}

	pr.Next = max(min(rejected, matchHint+1), 1)
	pr.MsgAppFlowPaused = false
	return true
}

Hi @joshuazh-x. The probe/replicate state distinction are heuristics. Regardless of what they do, it does not impact raft safety properties. Basically, probe/replicate try to guess what's the good MsgApp message to send. But ultimately, the follower gates these messages into the storage and rejects if they are inconsistent with the log state.

But to my understanding, allowing such a smaller index implies we want to tolerate loss of already replicated log entries.

Could you explain a little what you mean by loss? Do you mean we would be sending entries that are already in Match a second time?

Are you saying that this line sets pr.Next to be less than pr.Match?

pr.Next = max(min(rejected, matchHint+1), 1)

If so, that would not be by design, rather a bug. The invariant is that Next > Match.

This clause should probably check with pr.Match instead of pr.Next-1, just like the StateReplicate clause above. At the moment this code assumes there is only one probe in flight (in which case pr.Next-1 == pr.Match, and the two checks are equivalent), but this is generally wrong because messages can be delayed, reordered, and duplicated.

We need a test that demonstrates the bug, and a fix.

Hi @joshuazh-x. The probe/replicate state distinction are heuristics. Regardless of what they do, it does not impact raft safety properties. Basically, probe/replicate try to guess what's the good MsgApp message to send. But ultimately, the follower gates these messages into the storage and rejects if they are inconsistent with the log state.

But to my understanding, allowing such a smaller index implies we want to tolerate loss of already replicated log entries.

Could you explain a little what you mean by loss? Do you mean we would be sending entries that are already in Match a second time?
I meant we don't need to replicate entries preceding match index, The only necessity to do so is when follower would have lost replicated log entries during the same term, which is impossible.