Handle False Failure Detections

Question

Handle False Failure Detections

ayushr2 opened this issue 5 years ago · comments

In an asynchronous system, it is almost impossible to have safety and liveness for failure detection. This can lead to misclassification of nodes being dead.

We currently mark a node as dead if it does not send a heartbeat in 20 seconds. A machine can hang for longer and then continue executing too. So in heartbeat handler, grading job handler and grading result handler, we should check if the request is coming from a dead node. If so mark it as alive again.