Failure to install a Raft snapshot can lead to inconsistent primitive states
npepinpe opened this issue · comments
Expected behavior
When a snapshot fails to be installed properly (e.g. the snapshot was corrupted when it was replicated), or when a service fails to be restored from snapshot (i.e. the custom primitive throws an exception in its service restore
), processing of the state machine should stop, and we should not apply further entries that could lead to an inconsistent state (since we do not have the correctly restored state).
Ideally, the snapshot should be thrown away and the node should attempt to get a new snapshot from its leader. If it's currently restarting and it has no leader it should wait until a leader is elected and get a snapshot from it.
When snapshots are replicated to followers, it would probably be a good idea to also use a checksum to ensure the correct snapshot is written on the follower at the end.
Actual behavior
At the moment, exceptions thrown when installing snapshots are logged and swallowed; this can lead to primitive services being started without having been restored, and Raft entries applied to the primitives, leading to inconsistent states between nodes in the same raft.
Steps to reproduce
We had a hard time pinpointing how we ended up with corrupted snapshots (camunda/camunda#2359), but we are not the only ones (see #1007).
To test locally you can yourself start a raft with 3 nodes, then when they all have taken a snapshot, corrupt the snapshot manually (e.g. head -c -10 /path/to/partition/123.snapshot
and copy over) or simply throw an exception in a primitive service's restore.
Environment
- Atomix: master
- OS: linux
- JVM: java version "1.8.0_181" HotSpot
The documentation is currently work-in-progress and is not yet complete.
Have you searched the CLOSED issues already? How about checking in with the Atomix Google Group?