Failure to install a Raft snapshot can lead to inconsistent primitive states

Question

Failure to install a Raft snapshot can lead to inconsistent primitive states

npepinpe opened this issue 5 years ago · comments

Nicolas Pepin-Perreault commented 5 years ago

Expected behavior

When a snapshot fails to be installed properly (e.g. the snapshot was corrupted when it was replicated), or when a service fails to be restored from snapshot (i.e. the custom primitive throws an exception in its service restore), processing of the state machine should stop, and we should not apply further entries that could lead to an inconsistent state (since we do not have the correctly restored state).

Ideally, the snapshot should be thrown away and the node should attempt to get a new snapshot from its leader. If it's currently restarting and it has no leader it should wait until a leader is elected and get a snapshot from it.

When snapshots are replicated to followers, it would probably be a good idea to also use a checksum to ensure the correct snapshot is written on the follower at the end.

Actual behavior

At the moment, exceptions thrown when installing snapshots are logged and swallowed; this can lead to primitive services being started without having been restored, and Raft entries applied to the primitives, leading to inconsistent states between nodes in the same raft.

Steps to reproduce

We had a hard time pinpointing how we ended up with corrupted snapshots (camunda/camunda#2359), but we are not the only ones (see #1007).

To test locally you can yourself start a raft with 3 nodes, then when they all have taken a snapshot, corrupt the snapshot manually (e.g. head -c -10 /path/to/partition/123.snapshot and copy over) or simply throw an exception in a primitive service's restore.

Environment

Atomix: master
OS: linux
JVM: java version "1.8.0_181" HotSpot

^{⚠️️ Please verify that your issue still occurs on the latest version of Atomix before reporting.

The documentation is currently work-in-progress and is not yet complete.}

Have you searched the CLOSED issues already? How about checking in with the Atomix Google Group?