robinhood / airflow-prometheus-exporter

Prometheus Exporter for Airflow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Recommendations for using the metrics of this exporter for alerting

panovst opened this issue · comments

Hello!

I want to monitor Airflow for a problem with DAG execution.
For example:

  1. alert: for the interval "1 day" from the current moment, there was no successful completion of this DAG (group by dag_id)
  2. alert: DAG completion status is different from success (group by dag_id)

I'm trying to write alerts based on the metrics of this exporter, specifically airflow_dag_status, but I haven't been able to figure out how to do it yet.

In our projects, we do not re-run a failed task or a failed DAG, because it is enough for us that the next time the DAG is executed, the problem will go away.
Therefore, in the case of our projects, there will be no decrement in the airflow_dag_status metric, for example, for the failed status

I understand how to write such alerts for a metric that returns either 0 (task or DAG was not in this status), or 1 (task or DAG was in this status), but I don't understand how I can use a metric that counts for a specific task or DAG the total number of times in a given status for the entire time

I think that in the case of Airflow, the most important thing is not how many times a particular DAG was in a particular state, but what state a particular DAG is in now

Could you give recommendations on how to write alerts from the examples above, based on the metrics of this exporter (if at all possible)?

Related question in base project: epoch8/airflow-exporter#79 and epoch8/airflow-exporter#43