Prometheus告警规则示例:欢迎大家分享
Zhang21 opened this issue · comments
这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/
# centos6和7的内存空闲量计算
node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)
一个prometheus rules的示例,level
用作区分告警方式,level
, kind
用作告警抑制方式。
groups:
- name: node-cpu
rules:
# cpu核数
- record: instance:node_cpus:count
expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
# 每个cpu使用率
- record: instance_cpu:node_cpu_seconds_not_idle:rate1m
expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m]))
# 总cpu使用率
- record: instance:node_cpu_utilization:ratio
expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m)
- alert: cpu使用率大于88%
expr: instance:node_cpu_utilization:ratio * 100 > 88
for: 5m
labels:
severity: critical
level: 3
annotations:
summary: "cpu使用率大于85%"
description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
- alert: cpu使用率大于93%
expr: instance:node_cpu_utilization:ratio * 100 > 93
for: 2m
labels:
severity: emergency
level: 4
annotations:
summary: "cpu使用率大于93%"
description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}"
wxurl: "webhook1, webhook2"
mobile: "13xxx, 15xxx"
- alert: cpu负载大于Cores
expr: node_load5 > instance:node_cpus:count
for: 5m
labels:
severity: warning
level: 2
annotations:
summary: "cpu负载大于Cores"
description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
- alert: cpu负载大于2Cores
expr: node_load1 > (instance:node_cpus:count * 2)
for: 4m
labels:
severity: critical
level: 3
annotations:
summary: "cpu负载大于2Cores"
description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}"
alertgroup: ops
在特定时间触发/不触发告警,参考: https://www.robustperception.io/combining-alert-conditions
groups:
- name: 指定特定时间范围
rules:
- alert: 凌晨0点到6点不触发告警
# prometheus默认是utc时间,请注意
expr: promQL表达式 and ON() (hour() < 16 > 22)
你好!我先把wxurl这改为emailurl可以写为emailurl:url吗
您好,自定义模板方式如何对一个报警同时发送过个渠道(如既发送webhook、邮件通知又发送短信通知),我测试所得到的结果是只有一种渠道可以接收到报警消息,alertmanager测试配置是这样的:
global:
resolve_timeout: 5m
route:
group_by: ['gateway']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'webhook'
routes:
- receiver: 'prometheusalert-email'
- receiver: 'prometheusalert-dd'
receivers: - name: 'webhook'
webhook_configs: - name: 'prometheusalert-email'
webhook_configs: - name: 'prometheusalert-dd'
webhook_configs:
您好,自定义模板方式如何对一个报警同时发送过个渠道(如既发送webhook、邮件通知又发送短信通知),我测试所得到的结果是只有一种渠道可以接收到报警消息,alertmanager测试配置是这样的:
global: resolve_timeout: 5m route: group_by: ['gateway'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'webhook' routes:
receiver: 'prometheusalert-email'
receiver: 'prometheusalert-dd'
receivers:name: 'webhook'
webhook_configs:name: 'prometheusalert-email'
webhook_configs:name: 'prometheusalert-dd'
webhook_configs:
在 receiver 下面添加 continue: true
您好,请问有Oracle的告警规则吗?
请问
这个网站有很多Prometheus告警规则样例: https://awesome-prometheus-alerts.grep.to/
# centos6和7的内存空闲量计算 node_memory_MemAvailable_bytes or (node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes + node_memory_Slab_bytes)
一个prometheus rules的示例,
level
用作区分告警方式,level
,kind
用作告警抑制方式。groups: - name: node-cpu rules: # cpu核数 - record: instance:node_cpus:count expr: count without (cpu, mode) (node_cpu_seconds_total{mode="idle"}) # 每个cpu使用率 - record: instance_cpu:node_cpu_seconds_not_idle:rate1m expr: sum without (mode) (1 - rate(node_cpu_seconds_total{mode="idle"}[1m])) # 总cpu使用率 - record: instance:node_cpu_utilization:ratio expr: avg without (cpu) (instance_cpu:node_cpu_seconds_not_idle:rate1m) - alert: cpu使用率大于88% expr: instance:node_cpu_utilization:ratio * 100 > 88 for: 5m labels: severity: critical level: 3 kind: CpuUsage annotations: summary: "cpu使用率大于85%" description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}" - alert: cpu使用率大于93% expr: instance:node_cpu_utilization:ratio * 100 > 93 for: 2m labels: severity: emergency level: 4 kind: CpuUsage annotations: summary: "cpu使用率大于93%" description: "主机 {{ $labels.hostname }} 的cpu使用率为 {{ $value | humanize }}" wxurl: "webhook1, webhook2" mobile: "13xxx, 15xxx" - alert: cpu负载大于Cores expr: node_load5 > instance:node_cpus:count for: 5m labels: severity: warning level: 2 kind: CpuLoad annotations: summary: "cpu负载大于Cores" description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}" - alert: cpu负载大于2Cores expr: node_load5 > (instance:node_cpus:count * 2) for: 2m labels: severity: critical level: 3 kind: CpuLoad annotations: summary: "cpu负载大于2Cores" description: "主机 {{ $labels.hostname }} 的cpu负载为 {{ $value }}" wxurl: "webhook1, webhook2"在特定时间触发/不触发告警,参考: https://www.robustperception.io/combining-alert-conditions
groups: - name: 指定特定时间范围 rules: - alert: 凌晨0点到6点不触发告警 # prometheus默认是utc时间,请注意 expr: promQL表达式 and ON() (hour() < 16 > 22)
请问这个怎么用的呢,没在文档中找到。这个webhook1代表的地址在哪儿配置呢,app.conf?还是说把多个地址原文直接写在这个里面么
wxurl: "webhook1, webhook2"
mobile: "13xxx, 15xxx"
@running-db
多个地址写在里面。你看文档上都有写的。后面的功能上加上了告警组的功能,可以将告警组配置在 app.conf
配置里,然后 rules 里填写对应的一个/多个告警组就可以。