tikv / raft-rs

Raft distributed consensus algorithm implemented in Rust.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Initialization with term=0 shouldn't be allowed

GuyLewin opened this issue · comments

Describe the bug
Nodes often crashing (panic) during first few moments of Raft, when using distinct priority values per node together with pre_vote = true.
The least-prioritized node never crashes.

When I looked into it, I found that it's because during the pre-voting, we reject pre-voting requests from lower-priority nodes:
https://github.com/tikv/raft-rs/blob/82d704cdc3d93258be1f45efd715b95930764d7f/src/raft.rs#L1467C65-L1467C98

We then try to send the rejection back to the proposing node, including our term as the message's term:
https://github.com/tikv/raft-rs/blob/82d704cdc3d93258be1f45efd715b95930764d7f/src/raft.rs#L1494C21-L1498C58

But during that send, term is checked to be != 0, resulting in a panic instead of sending the proposal rejection:
https://github.com/tikv/raft-rs/blob/82d704cdc3d93258be1f45efd715b95930764d7f/src/raft.rs#L617C9-L640C14

Expected behavior
PreVoting rejection and not panic().

commented

Then term should be assigned to vote response.

Then term should be assigned to vote response.

It is assigned, but the term is still 0 when the node initializes, unless I misunderstood what you were referring to.
I can attach the full backtrace if that helps, but the fatal is being called in this scenario pretty consistently.

commented

I see. When the raft group is first initialized, it should not have term 0 (INVALID_TERM).

@BusyJay makes sense, thank you!
I will initialize all our nodes with term=1 in their Raft state.

commented

it should not have term 0

Perhaps we check it on initial state loaded? I encountered this issue days before also.

commented

There are two cases: 1. constructing the raft node with existing data, so the node is already initialized and should have a valid term; 2. constructing the raft node without any data, so the node is not initialized and can have term 0.

Though adding a check that term must not be 0 if configuration is not empty should be OK.

Though adding a check that term must not be 0 if configuration is not empty should be OK.

I can work on a PR for that.
We were constructing a raft node without any data, but using pre-vote and priority so we ended up rejecting even before any data was sent, therefore getting this error during rejection message send.
But anyway, this check will help ensure this never happens.

commented

Please send it when you are free, thanks!

commented

Reopened

Created a PR - #513