Enhancing fault tolerance for Katalyst Agent

Question

Enhancing fault tolerance for Katalyst Agent

caohe opened this issue a year ago · comments

What would you like to be added?

We propose enhancing the fault tolerance capabilities of the Katalyst Agent to ensure more reliable operation in the presence of failures. This includes the following two main aspects:

Enhanced Health Check Criteria: We aim to incorporate a broader range of factors into the health check endpoint of the Katalyst Agent. Currently, the health check primarily focuses on basic connectivity and liveness. We suggest extending this to consider additional dimensions such as the status of the QRM Plugin. This would provide a more comprehensive assessment of the Agent's operational state and help prevent potential issues before they escalate.
Diversified Failure Handling: Currently, when the Katalyst Agent encounters a failure, it employs a limited set of recovery measures such as preventing further scheduling and eviction. We believe it would greatly benefit the system's reliability if we introduce a wider range of actions that can be taken in response to Agent failures. These measures could include dynamic adjustments of resource allocations, etc. By diversifying the recovery strategies, we can increase the likelihood of successful recovery from various failure scenarios.

Why is this needed?

Currently, the health checks and failure handling measures for Katalyst Agent are limited, which cannot meet the stability requirements.