Document how to remove a CHT instance from being alerted on
mrjones-plip opened this issue · comments
There is an instance that was entered into cht-instnaces.yml
, got some alerts fired, and then was removed from cht-instnaces.yml
. However, the alerts continue to fire for that instance. we should figure how to stop them from firing and document this process.
see slack thread.
Okay, I think I made it to the bottom of this issue!
TLDR is that when a server goes down, Prometheus will keep trying to scrape it and keep recording values for the up
metric as 0
. However, when a server is removed from cht-instances.yml
Prometheus will stop trying to scrape it and will stop recording values for up
. The Grafana alert will still continue to fire for ~10 more minutes since that is the data-window we have configured for the alert query (it was the default). Once all up
data for the deleted instance is outside that window, Grafana should stop alerting on that instance.
Here is the scenario:
2023-04-17
Allies instance is configured to monitor CHT instance (DELETED)- This is the first time we start seeing data collected for the instance.
2023-05-01
DELETED is shut down- This is the last time we see data collected from this instance
- The "API Server Down" alert continues to fire because Prometheus is still trying to scrape data from the instance and the
up
value for DELETED is recoreded as0
.
2023-05-14
@mrjones-plip removes DELETED fromcht-instances.yml
up
data continues to be collected for DELETED (presumably because the Allies instance was not restarted at this time 🤔 )- Alert continues to fire
2023-05-18
@jkuester restarts the Allies instance (down/up the docker config)- Prometheus stops trying to scrape DELETED and no more
up
values are recorded for that instance - ~10min later the "API Server Down" alert stops firing for DELETED because there is no longer any data at all for DELETED in the query window for the alert.
- Prometheus stops trying to scrape DELETED and no more
So, the alert at the root of this issue is not being shown in Grafana any more. I think now we understand why that is and how we expect Watchdog to behave when removing CHT instances (restarting is required!).
For good measure, I will open a Docs PR to add a reminder to always restart the containers when making config changes.