Prometheus告警规则示例：欢迎大家分享

Question

Prometheus告警规则示例：欢迎大家分享

Zhang21 opened this issue 4 years ago · comments

这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/

# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)

一个prometheus rules的示例，level用作区分告警方式，level, kind用作告警抑制方式。

groups:
- name: node-cpu
  rules:
  # cpu核数
  - record: instance:node_cpus:count
    expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
  # 每个cpu使用率
  - record: instance_cpu:node_cpu_seconds_not_idle:rate1m
    expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
  # 总cpu使用率
  - record: instance:node_cpu_utilization:ratio
    expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m)

  - alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
  - alert: cpu使用率大于93%
    expr: instance:node_cpu_utilization:ratio * 100 > 93
    for: 2m
    labels:
      severity: emergency
      level: 4
    annotations:
      summary: "cpu使用率大于93%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
      wxurl: "webhook1, webhook2"
      mobile: "13xxx, 15xxx"

  - alert: cpu负载大于Cores
    expr: node_load5 > instance:node_cpus:count
    for: 5m
    labels:
      severity: warning
      level: 2
    annotations:
      summary: "cpu负载大于Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
  - alert: cpu负载大于2Cores
    expr: node_load1 > (instance:node_cpus:count * 2) 
    for: 4m
    labels:
      severity: critical
      level: 3
    annotations:
      summary: "cpu负载大于2Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
      alertgroup: ops

在特定时间触发/不触发告警，参考: https://www.robustperception.io/combining-alert-conditions

groups:
- name: 指定特定时间范围
  rules:
  - alert: 凌晨0点到6点不触发告警
    # prometheus默认是utc时间，请注意
    expr: promQL表达式 and ON() (hour() < 16  > 22)

TarsCppCIDemo · Answer 1 · Wed Mar 31 2021 13:35:15 GMT+0800 (China Standard Time)

你好！我先把wxurl这改为emailurl可以写为emailurl：url吗

zhangming · Answer 2 · Sat Nov 13 2021 16:39:56 GMT+0800 (China Standard Time)

您好，自定义模板方式如何对一个报警同时发送过个渠道（如既发送webhook、邮件通知又发送短信通知），我测试所得到的结果是只有一种渠道可以接收到报警消息，alertmanager测试配置是这样的：

global:
resolve_timeout: 5m
route:
group_by: ['gateway']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'webhook'
routes:

receiver: 'prometheusalert-email'
receiver: 'prometheusalert-dd'
receivers:
name: 'webhook'
webhook_configs:
- url: 'http://172.0.0.1:8891/alertmanager/addAlert/'
name: 'prometheusalert-email'
webhook_configs:
- url: 'http://172.0.0.1:8080/prometheusalert?type=email&tpl=prometheus-email&email=t***@126.com'
name: 'prometheusalert-dd'
webhook_configs:
- url: 'http://172.0.0.1:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=******&at=******'

michael-liumh · Answer 3 · Thu Feb 17 2022 15:31:48 GMT+0800 (China Standard Time)

您好，自定义模板方式如何对一个报警同时发送过个渠道（如既发送webhook、邮件通知又发送短信通知），我测试所得到的结果是只有一种渠道可以接收到报警消息，alertmanager测试配置是这样的：

global: resolve_timeout: 5m route: group_by: ['gateway'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'webhook' routes:

receiver: 'prometheusalert-email'

receiver: 'prometheusalert-dd'
receivers:

name: 'webhook'
webhook_configs:

url: 'http://172.0.0.1:8891/alertmanager/addAlert/'

name: 'prometheusalert-email'
webhook_configs:

url: 'http://172.0.0.1:8080/prometheusalert?type=email&tpl=prometheus-email&email=t***@126.com'

name: 'prometheusalert-dd'
webhook_configs:

url: 'http://172.0.0.1:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=******&at=******'

在 receiver 下面添加 continue: true

qianglatiao · Answer 4 · Wed Jul 20 2022 17:58:02 GMT+0800 (China Standard Time)

您好，请问有Oracle的告警规则吗？

TarsCppCIDemo · Answer 5 · Wed Jul 20 2022 18:18:54 GMT+0800 (China Standard Time)

没有，Oracle插件没有使用

…

---原始邮件--- 发件人: ***@***.***> 发送时间: 2022年7月20日(周三) 下午5:58 收件人: ***@***.***>; 抄送: ***@***.******@***.***>; 主题: Re: [feiyu563/PrometheusAlert] Prometheus告警规则示例：欢迎大家分享 (#89) 您好，请问有Oracle的告警规则吗？ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

running-db · Answer 6 · Sun Nov 05 2023 17:14:55 GMT+0800 (China Standard Time)

请问

这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/

# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)

一个prometheus rules的示例，level用作区分告警方式，level, kind用作告警抑制方式。

groups:
- name: node-cpu
  rules:
  # cpu核数
  - record: instance:node_cpus:count
    expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
  # 每个cpu使用率
  - record: instance_cpu:node_cpu_seconds_not_idle:rate1m
    expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
  # 总cpu使用率
  - record: instance:node_cpu_utilization:ratio
    expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m)

  - alert: cpu使用率大于88%
    expr: instance:node_cpu_utilization:ratio * 100 > 88
    for: 5m
    labels:
      severity: critical
      level: 3
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于85%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"

  - alert: cpu使用率大于93%
    expr: instance:node_cpu_utilization:ratio * 100 > 93
    for: 2m
    labels:
      severity: emergency
      level: 4
      kind: CpuUsage
    annotations:
      summary: "cpu使用率大于93%"
      description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
      wxurl: "webhook1, webhook2"
      mobile: "13xxx, 15xxx"

  - alert: cpu负载大于Cores
    expr: node_load5 > instance:node_cpus:count
    for: 5m
    labels:
      severity: warning
      level: 2
      kind: CpuLoad
    annotations:
      summary: "cpu负载大于Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"

  - alert: cpu负载大于2Cores
    expr: node_load5 > (instance:node_cpus:count * 2) 
    for: 2m
    labels:
      severity: critical
      level: 3
      kind: CpuLoad
    annotations:
      summary: "cpu负载大于2Cores"
      description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
      wxurl: "webhook1, webhook2"

在特定时间触发/不触发告警，参考: https://www.robustperception.io/combining-alert-conditions

groups:
- name: 指定特定时间范围
  rules:
  - alert: 凌晨0点到6点不触发告警
    # prometheus默认是utc时间，请注意
    expr: promQL表达式 and ON() (hour() < 16  > 22)

请问这个怎么用的呢，没在文档中找到。这个webhook1代表的地址在哪儿配置呢，app.conf?还是说把多个地址原文直接写在这个里面么

wxurl: "webhook1, webhook2"
mobile: "13xxx, 15xxx"

Leslie Zhang · Answer 7 · Mon Nov 13 2023 10:53:59 GMT+0800 (China Standard Time)

@running-db
多个地址写在里面。你看文档上都有写的。后面的功能上加上了告警组的功能，可以将告警组配置在 app.conf 配置里，然后 rules 里填写对应的一个/多个告警组就可以。