larssb / HealOps

A monitoring and healing framework. It uses Pester tests (TDD) to determine the state of an IT system entity. Then, if the entity is in a faulted state HealOps will try to repair it. All along HealOps reports metrics to a backend report system and HealOps status is sent to stakeholders. In order to e.g. trigger alarms and get on-call personnel on an issue that could not be repaired.

Home Page:https://healops.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement it so, that if alarm that state data could not be reported it will be known.

larssb opened this issue · comments

  • Find the proper way to do it. As we can not report back to 'x' time series database.
  • Is e-mail good enough, it is not a fail-safe way (SMTP is not fail-safe)
  • Slack? Would be good.
  • A direct implementation to 'x' incident management system like OpsGenie. Through a WebHook or whatever.

Dropping this. As the way to go, at least for now, is have e.g. Grafana alert the on-call personel if 'x' panel has no data.

  • This is easier.
  • Makes HealOps simpler, code wise and architecture wise.
  • The feature is already in Grafana. So, why not use it!