rancher / opni

Multi Cluster Observability with AIOps

Home Page:https://opni.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cortexadmin `GetRule` : context deadline exceeded when evaluating alerting status.

alexandreLamarre opened this issue · comments

Cortexadmin GetRule

2023-09-28T19:51:17Z ERROR plugin.metrics.cortex-admin cortex/admin.go:810 failed with Get "https://cortex-ruler:8080/prometheus/api/v1/rules": context canceled {"request": "https://cortex-ruler:8080/prometheus/api/v1/rules"}

root cause is likely caused by duplicate metric registration which causes a loaded rule to be invalid:

{"caller":"manager.go:677","err":"found duplicate series for the match group {instance=\"xx.yyy.zz.92:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"xx.yyy.zz.92:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"[ip-xx-yyy-zz-92.mydomain.com](http://ip-xx-yyy-zz-92.mydomain.com/)\", prometheus=\"opni/opni-prometheus-agent\", prometheus_replica=\"prom-agent-opni-prometheus-agent-0\", service=\"rancher-mon-me-cluster-k8s-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"xx.yyy.zz.92:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"[ip-xx-yyy-zz-92.mydomain.com](http://ip-xx-yyy-zz-92.mydomain.com/)\", prometheus=\"opni/opni-prometheus-agent\", prometheus_replica=\"prom-agent-opni-prometheus-agent-0\", service=\"opni-kube-prometheus-stack-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side","file":"/rules/f8985de4-c040-40e9-9df6-9814c5582185/synced","group":"kubelet.rules","index":0,"level":"warn","msg":"Evaluating rule failed","name":"node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile","rule":"record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile\nexpr: histogram_quantile(0.99, sum by (cluster, instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket{job=\"kubelet\",metrics_path=\"/metrics\"}[5m]))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"})\nlabels:\n  quantile: \"0.99\"\n","ts":"2023-09-28T17:40:34.900657379Z","user":"f8985de4-c040-40e9-9df6-9814c5582185"}

which propagates to the sync task running:

2023-09-28T19:51:26Z ERROR plugin.alerting alerting/admin.go:541  ran 3/4 tasks successfully context deadline exceeded {"action": "runSyncTasks"}
2023-09-28T19:51:26Z ERROR plugin.alerting alerting/admin.go:565 failed to successfully run all alerting sync tasks : context deadline exceeded

which in turn could be causing 1719

This seems to be resolved by #1563 , but I'll keep this open in case it pops up again