Some useful guides for SRE team.
To add a new runbook for the alert, please follow OpenShift runbook template to create it.
- AAPDeploymentReplicasMismatch
- AAPMetricEndpointDown
- AAPPodContainerTerminated
- AAPPodFrequentlyRestarting
- AAPPodNotReady
- AAPPodRestartingTooMuch
- AAPStatefulSetReplicasMismatch
- AlertmanagerClusterFailedToSendAlerts
- AlertmanagerFailedReload
- AlertmanagerFailedToSendAlerts
- ArgoCDSyncAlert
- ClusterOperatorDown
- ClusterVersionOperatorDown
- etcdBackendQuotaLowSpace
- etcdGRPCRequestsSlow
- etcdHighFsyncDurations
- etcdInsufficientMembers
- etcdMembersDown
- etcdNoLeader
- ExtremelyHighIndividualControlPlaneCPU
- HAProxyDown
- KubeAPIDown
- KubeAPIErrorBudgetBurn
- KubeControllerManagerDown
- KubeletDown
- KubePersistentVolumeFillingUp
- KubeQuotaAlmostFull
- KubeSchedulerDown
- KubeStateMetricsListErrors
- KubeStateMetricsWatchErrors
- MachineAPIOperatorMetricsCollectionFailing
- MCDRebootError
- MultipleContainersOOMKilled
- NodeFileDescriptorLimit
- NodeFilesystemAlmostOutOfFiles
- NodeFilesystemAlmostOutOfSpace
- NodeFilesystemFilesFillingUp
- NodeFilesystemSpaceFillingUp
- NodeRAIDDegraded
- PodDisruptionBudgetLimit
- PrometheusErrorSendingAlertsToSomeAlertmanagers
- PrometheusTargetSyncFailure
- Watchdog
SRE quickstart, helpful documents and links.