medic / cht-watchdog

Configuration for deploying a monitoring/alerting stack for CHT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Document how to remove a CHT instance from being alerted on

mrjones-plip opened this issue · comments

There is an instance that was entered into cht-instnaces.yml, got some alerts fired, and then was removed from cht-instnaces.yml. However, the alerts continue to fire for that instance. we should figure how to stop them from firing and document this process.

see slack thread.

Okay, I think I made it to the bottom of this issue!

TLDR is that when a server goes down, Prometheus will keep trying to scrape it and keep recording values for the up metric as 0. However, when a server is removed from cht-instances.yml Prometheus will stop trying to scrape it and will stop recording values for up. The Grafana alert will still continue to fire for ~10 more minutes since that is the data-window we have configured for the alert query (it was the default). Once all up data for the deleted instance is outside that window, Grafana should stop alerting on that instance.

image

Here is the scenario:

  • 2023-04-17 Allies instance is configured to monitor CHT instance (DELETED)
    • This is the first time we start seeing data collected for the instance.
  • 2023-05-01 DELETED is shut down
    • This is the last time we see data collected from this instance
    • The "API Server Down" alert continues to fire because Prometheus is still trying to scrape data from the instance and the up value for DELETED is recoreded as 0.
  • 2023-05-14 @mrjones-plip removes DELETED from cht-instances.yml
    • up data continues to be collected for DELETED (presumably because the Allies instance was not restarted at this time 🤔 )
    • Alert continues to fire
  • 2023-05-18 @jkuester restarts the Allies instance (down/up the docker config)
    • Prometheus stops trying to scrape DELETED and no more up values are recorded for that instance
    • ~10min later the "API Server Down" alert stops firing for DELETED because there is no longer any data at all for DELETED in the query window for the alert.

So, the alert at the root of this issue is not being shown in Grafana any more. I think now we understand why that is and how we expect Watchdog to behave when removing CHT instances (restarting is required!).

For good measure, I will open a Docs PR to add a reminder to always restart the containers when making config changes.