Handle False Failure Detections
ayushr2 opened this issue · comments
Ayush Ranjan commented
In an asynchronous system, it is almost impossible to have safety and liveness for failure detection. This can lead to misclassification of nodes being dead.
We currently mark a node as dead if it does not send a heartbeat in 20 seconds. A machine can hang for longer and then continue executing too. So in heartbeat handler, grading job handler and grading result handler, we should check if the request is coming from a dead node. If so mark it as alive again.