UWIT-UE / am2alertapi

Prometheus alertmanager to UW alertAPI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

am2alertapi fix metrics reporting for multi-worker configuration

EricHorst opened this issue · comments

In diagnosing a recent outage, it was noted that the am2alertapi counters were not increasing values. After some thought it became clear that the problem is with multiple workers they run as separate processes and thus each worker keeps its own metrics independently. (The worker count was increased in October 2021 in 4539a9c)

Researching suggests a solution either using prometheus_client multiprocess mode, example here: https://github.com/amitsaha/python-prometheus-demo/tree/master/flask_app_prometheus_multiprocessing

Or add worker number as a metric label and aggregate in prometheus.

Here's a reference: https://echorand.me/posts/python-prometheus-monitoring-options/

  1. Implemented multi-process metrics in am2alertapi.py
  2. Rolled out new version to all clusters and to prom01/prom02 https://github.com/UWIT-UE/am2alertapi/releases/tag/v1.0.7
  3. Changed alert rules to properly sum counters https://github.com/UWIT-UE/mci-ops/pull/391
  4. Added am2alertapi scraping and alert rules to prom01/prom02 which did not have them.