UWIT-UE / am2alertapi

Prometheus alertmanager to UW alertAPI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

am2alertapi gets confused after alertapi outages

EricHorst opened this issue · comments

There were two large scale AlertAPI outages. In the most recent one, after the outage was resolved, watchdog errors persisted showing up as status code 500 being returned intermittently on some MCI clusters. (INC2214230)

We found that restarting am2alertapi pods restored service. Apparently one pod or maybe one worker process in a pod was in a weird state. Looking at the code, we could see no obvious place that the code gets stuck. It's a oneshot request for each incoming request with theoretically no state.

Lacking any other bright ideas I think a good solution would be to use gunicorn max-requests to restart workers periodically. (See https://docs.gunicorn.org/en/stable/settings.html#max-requests) The restart count should account for the frequency of health checks with a goal of restarting every hour or two. If this addresses the problem described, there should not be a period of more than a hour or two after a major outage before confused workers get restarted.

We also found #22 during diagnosis.

After reviewing and fixing #22 I'm pretty convinced that the confusion that am2alertapi had after the alertapi outages was entirely related to the metrics not counting correctly across all am2alertapi pods and worker processes. I'm going to assume that's the case until events prove otherwise.

If it does resurge as a problem, the next step I think is to use the gunicorn worker recycling option to restart workers periodically to eliminate any bad state.