feiyu563 / PrometheusAlert

Prometheus Alert是开源的运维告警中心消息转发系统,支持主流的监控系统Prometheus,Zabbix,日志系统Graylog和数据可视化系统Grafana发出的预警消息,支持钉钉,微信,华为云短信,腾讯云短信,腾讯云电话,阿里云短信,阿里云电话等

Home Page:https://feiyu563.gitbook.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Prometheus告警规则示例:欢迎大家分享

Zhang21 opened this issue · comments

这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/



# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)

一个prometheus rules的示例,level用作区分告警方式,level, kind用作告警抑制方式。


groups:
- name: node-cpu
  rules:
  # cpu核数
  - record: instance:node_cpus:count
    expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
  # 每个cpu使用率
  - record: instance_cpu:node_cpu_seconds_not_idle:rate1m
    expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
  # 总cpu使用率
  - record: instance:node_cpu_utilization:ratio
    expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m)

  - alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
  - alert: cpu使用率大于93%
    expr: instance:node_cpu_utilization:ratio * 100 > 93
    for: 2m
    labels:
      severity: emergency
      level: 4
    annotations:
      summary: "cpu使用率大于93%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
      wxurl: "webhook1, webhook2"
      mobile: "13xxx, 15xxx"

  - alert: cpu负载大于Cores
    expr: node_load5 > instance:node_cpus:count
    for: 5m
    labels:
      severity: warning
      level: 2
    annotations:
      summary: "cpu负载大于Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
  - alert: cpu负载大于2Cores
    expr: node_load1 > (instance:node_cpus:count * 2) 
    for: 4m
    labels:
      severity: critical
      level: 3
    annotations:
      summary: "cpu负载大于2Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
      alertgroup: ops

在特定时间触发/不触发告警,参考: https://www.robustperception.io/combining-alert-conditions

groups:
- name: 指定特定时间范围
  rules:
  - alert: 凌晨0点到6点不触发告警
    # prometheus默认是utc时间,请注意
    expr: promQL表达式 and ON() (hour() < 16  > 22)

你好!我先把wxurl这改为emailurl可以写为emailurl:url吗

您好,自定义模板方式如何对一个报警同时发送过个渠道(如既发送webhook、邮件通知又发送短信通知),我测试所得到的结果是只有一种渠道可以接收到报警消息,alertmanager测试配置是这样的:

global:
resolve_timeout: 5m
route:
group_by: ['gateway']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'webhook'
routes:

您好,自定义模板方式如何对一个报警同时发送过个渠道(如既发送webhook、邮件通知又发送短信通知),我测试所得到的结果是只有一种渠道可以接收到报警消息,alertmanager测试配置是这样的:

global: resolve_timeout: 5m route: group_by: ['gateway'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'webhook' routes:

在 receiver 下面添加 continue: true

您好,请问有Oracle的告警规则吗?

请问

这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/

# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)

一个prometheus rules的示例,level用作区分告警方式,level, kind用作告警抑制方式。

groups:
- name: node-cpu
  rules:
  # cpu核数
  - record: instance:node_cpus:count
    expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
  # 每个cpu使用率
  - record: instance_cpu:node_cpu_seconds_not_idle:rate1m
    expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
  # 总cpu使用率
  - record: instance:node_cpu_utilization:ratio
    expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m)

  - alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"

  - alert: cpu使用率大于93%
    expr: instance:node_cpu_utilization:ratio * 100 > 93
    for: 2m
    labels:
      severity: emergency
      level: 4
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于93%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
      wxurl: "webhook1, webhook2"
      mobile: "13xxx, 15xxx"

  - alert: cpu负载大于Cores
    expr: node_load5 > instance:node_cpus:count
    for: 5m
    labels:
      severity: warning
      level: 2
      kind: CpuLoad
    annotations:
      summary: "cpu负载大于Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"

  - alert: cpu负载大于2Cores
    expr: node_load5 > (instance:node_cpus:count * 2) 
    for: 2m
    labels:
      severity: critical
      level: 3
      kind: CpuLoad
    annotations:
      summary: "cpu负载大于2Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
      wxurl: "webhook1, webhook2"

在特定时间触发/不触发告警,参考: https://www.robustperception.io/combining-alert-conditions

groups:
- name: 指定特定时间范围
  rules:
  - alert: 凌晨0点到6点不触发告警
    # prometheus默认是utc时间,请注意
    expr: promQL表达式 and ON() (hour() < 16  > 22)

请问这个怎么用的呢,没在文档中找到。这个webhook1代表的地址在哪儿配置呢,app.conf?还是说把多个地址原文直接写在这个里面么

wxurl: "webhook1, webhook2"
mobile: "13xxx, 15xxx"

@running-db
多个地址写在里面。你看文档上都有写的。后面的功能上加上了告警组的功能,可以将告警组配置在 app.conf 配置里,然后 rules 里填写对应的一个/多个告警组就可以。