[bug] CPU dashboard can report negative values
uhthomas opened this issue · comments
Hi @uhthomas,
Thank you for opening this issue.
The values are quite different, did you compare them with other system tools or the metrics from the metrics-server (kubectl top)?
I will need to make some tests before approving your PR.
Hi,
I also read #80 and saw that the values are different between #80 and main branch. Based on this, I do a research how other parties are doing CPU calculation:
Dashboard to test on your own
{
"__inputs": [
{
"name": "DS_PROMETHEUS",
"label": "Prometheus",
"description": "",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
],
"__elements": {},
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "10.2.2"
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
},
{
"type": "panel",
"id": "table",
"name": "Table",
"version": ""
}
],
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {
"align": "auto",
"cellOptions": {
"type": "auto"
},
"filterable": true,
"inspect": false
},
"decimals": 3,
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percentunit"
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Value"
},
"properties": [
{
"id": "unit"
}
]
}
]
},
"gridPos": {
"h": 23,
"w": 24,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"cellHeight": "sm",
"footer": {
"countRows": false,
"fields": "",
"reducer": [
"sum"
],
"show": false
},
"frameIndex": 0,
"showHeader": true
},
"pluginVersion": "10.2.2",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "avg(1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) by (instance) ",
"format": "table",
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "main"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "avg(rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "PR"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "avg(irate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by(instance)",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "node_exporter_full"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])))",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "Prometheus Alerts"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "sum by (instance) (rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\",mode!=\"steal\"}[$__rate_interval]))",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "kubernetes-mixin"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": " sum by (instance) (\n (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval])))\n / ignoring(cpu) group_left\n count without (cpu, mode) (node_cpu_seconds_total{mode=\"idle\"})\n )",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "node-mixin D"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "1 - avg by (instance) (\n sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval]))\n)",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "node-mixin R"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "sum by (instance) (sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval]))))",
"format": "table",
"hide": false,
"instant": true,
"legendFormat": "__auto",
"range": false,
"refId": "node-mixin A"
}
],
"title": "Panel Title",
"transformations": [
{
"id": "merge",
"options": {}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true
},
"indexByName": {
"Time": 0,
"Value #PR": 2,
"Value #Prometheus Alerts": 4,
"Value #kubernetes-mixin": 9,
"Value #main": 8,
"Value #node-mixin A": 5,
"Value #node-mixin D": 6,
"Value #node-mixin R": 7,
"Value #node_exporter_full": 3,
"instance": 1
},
"renameByName": {}
}
}
],
"type": "table"
}
],
"refresh": "",
"schemaVersion": 38,
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-5m",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "CPU test",
"uid": "f551a6d1-ff6e-45b1-a7a0-84cf70124b75",
"version": 3,
"weekStart": ""
}
main branch
avg(1-rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) by (instance)
PR
avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)
node_exporter full dashboard is using
avg(irate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by(instance)
Awesome Prometheus alerts:
sum by (cluster, instance) (avg by (mode, cluster, instance) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])))
kubernetes-mixin
Note: 100% = 1 Core
sum by (cluster,instance) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[$__rate_interval]))
node-mixin (official node_exporter)
recording rule
1 - avg without (cpu) (
sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval]))
)
alerting rule
sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[2m])))
sum by (instance) (
(1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval])))
/ ignoring(cpu) group_left
count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
)
I could only do a test with an small subset of nodes:
% k top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
aks-opsstack-21479518-vmss000000 251m 13% 3573Mi 28%
aks-opsstack-21479518-vmss000001 600m 31% 7863Mi 63%
But the values from #80 are different compare to kubectl top
Thanks for the comprehensive write up @jkroepke! Given most dashboards use mode!="idle"
, it looks like this change is probably the right thing to do then? It's also what is recommended by Tigera as linked in the PR.
With respect to kubectl top, I think it's known that these values can differ. I believe it's just the different time intervals and the way the metrics are collected?
At least the queries from #81 give me incorrect results
PR
vs
# k top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
aks-opsstack-21479518-vmss000000 232m 12% 3619Mi 29%
aks-opsstack-21479518-vmss000001 613m 32% 7920Mi 64%
In my post before, I do more a brain dump of my findings. Maybe #81 is wrongly interpret and to solve the issue from @uhthomas, I figure out some alternatives.
Since only he has the issue, he should have to tests some queries.
I am not sure your evaluation is fair. They are taking measurements at different points in time and does not mean the query in #81 is incorrect. If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.
They are taking measurements at different points in time
Compared to kubectl top
, yes.
But the queries from the dashboard based on the same datapoints. The dashboard on main branch show me 30%-35% CPU usage which is way more as the reported 4.2% from #81. The CPU on the system has a constant usage between 30%-35%. over hours. 4% is not possible. This is why, I declare the query as incorrect.
If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.
I would say 5 of 8 queries are true-ish. All values based on the exact same datapoints from different instances.
@jkroepke I do see what you mean. If you read my previous comment, it may best to calculate average cpu usage by core (avg(sum by (cpu) (...))
). I can make this change in the PR and it should be more accurate.
The original query vs the query avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])))
:
The other option is to change the graph to measure different CPU modes, or even cores? That wasn't really its intent though I guess.
If you read my previous comment, it may best to calculate average cpu usage by core ...
Yeah, on the Averaged by all
dashboard, you do avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)
This mean, an average of all CPU modes. If you have user 30% system 0% and iowait 0%, you have 30/3=10% CPU usage.
The query was used on the node dashboard.
It looks much better now. 👍
FYI: While I do some research for the queries, I found prometheus/node_exporter#2194.
I created a separate issue for this: #86
This is an interesting discussion, I didn't have time to make some benchs, but It looks promising!
I tried the latest version and still see a big difference in the resulting values on my side (~ x3).
I will need to deep dive to make sure we get it right (most probably in January).
@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.
Yes it was the latest, CPU usage is 3x higher on the new version.
I'll need to check/compare to find which query is the closest to the reality.
Would you be able to attach a screenshot? The amended query should be more accurate.
As you can see, the values are quite different (at least on my side).
Comparing the results with trusted system tools or software can help I think.
I'm pretty sure I did that a long time ago, and it looked good to me, but maybe it's wrong...
We should definitely take the time to get this right.
PS: I don't think I will have time to look further before January 🎄 🥳
Thanks for your help @dotdc - that is interesting. I would be eager to see the individual values for the different modes on your cores. I wonder if the system is busy in the other idle states iowait and steal?
Yes, exactly, but maybe with distinct colours for the values?
Thanks, you too!
This query could possibly be helpful? If there is a lot of cpu time spent on steal or iowait, then it would make sense that comparing to just idle would produce a wildly different graph.
avg by (mode) (rate(node_cpu_seconds_total{mode=~"steal|iowait"}))
I have a feeling the new query is a more accurate representation of actual CPU usage - but the graphs shown here look very suspect.
The same graph on my cluster also shows discrepancies (as expected), but not to such huge degrees.
(new on bottom)
Please also see these dashboards side-by-side. The first one is the original, the second matches everything but "idle" and the third one is the current query which matches everything but "idle", "iowait" and "steal". The final graph shows CPU usage across the whole cluster by namespace. There is a clear discrepancy, and the third graph seems the most accurate to me.
For context, there are 20 allocatable cpus on the cluster. I do not see how 50% utilisation could ever make sense.
The following is the same as the original image, but with stacked cpu usage to demonstrate that a value of 50% is unrealistic.
Hi @uhthomas,
I made a limited number of additional tests this morning.
Your query is good on the nodes dashboard, but the differences from my previous screenshots remains on the global view.
I've managed to get closer to your values by dividing the result by the number of nodes.
avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!~"idle|steal|iowait", cluster="$cluster"}[$__rate_interval]))) / count(count by (node) (kube_node_info{cluster="$cluster"}))
This should work for clusters that have homogenous nodes flavors across nodepools, but I have concerns on clusters that have heterogenous nodepools/flavors.
Could you double-check this on your setup?
Also, do you have a cluster with different node flavors to see how this behaves?
@dotdc I am currently running a single node Kubernetes cluster, I was not aware of this limitation. I imagine what's happening here is it should be sum by (node, cpu)
. I can fix this when I get back later, and should resolve the issue you're seeing 😄
Would you be able to test it for me in the meantime?
🎉 This issue has been resolved in version 1.1.0 🎉
The release is available on GitHub release
Your semantic-release bot 📦🚀