am2alertapi fix metrics reporting for multi-worker configuration
EricHorst opened this issue · comments
In diagnosing a recent outage, it was noted that the am2alertapi counters were not increasing values. After some thought it became clear that the problem is with multiple workers they run as separate processes and thus each worker keeps its own metrics independently. (The worker count was increased in October 2021 in 4539a9c)
Researching suggests a solution either using prometheus_client multiprocess mode, example here: https://github.com/amitsaha/python-prometheus-demo/tree/master/flask_app_prometheus_multiprocessing
Or add worker number as a metric label and aggregate in prometheus.
Here's a reference: https://echorand.me/posts/python-prometheus-monitoring-options/
- Implemented multi-process metrics in am2alertapi.py
- Rolled out new version to all clusters and to prom01/prom02 https://github.com/UWIT-UE/am2alertapi/releases/tag/v1.0.7
- Changed alert rules to properly sum counters https://github.com/UWIT-UE/mci-ops/pull/391
- Added am2alertapi scraping and alert rules to prom01/prom02 which did not have them.