[bug] CPU dashboard can report negative values

Question

[bug] CPU dashboard can report negative values

uhthomas opened this issue 8 months ago · comments

Thomas commented 8 months ago

Describe the bug

How to reproduce?

I don't know

Expected behavior

The dashboard should not produce negative CPU usage values.

Additional context

I adjusted some resource limits, which caused some pods to restart.

Thomas · Answer 1 · Mon Dec 18 2023 20:57:36 GMT+0800 (China Standard Time)

I don't really understand why.

Thomas · Answer 2 · Mon Dec 18 2023 21:03:44 GMT+0800 (China Standard Time)

Looks normal when switching the query to avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval]))

versus

David Calvert · Answer 3 · Wed Dec 20 2023 18:08:49 GMT+0800 (China Standard Time)

Hi @uhthomas,
Thank you for opening this issue.
The values are quite different, did you compare them with other system tools or the metrics from the metrics-server (kubectl top)?

I will need to make some tests before approving your PR.

Jan-Otto Kröpke · Answer 4 · Wed Dec 20 2023 20:27:11 GMT+0800 (China Standard Time)

Hi,

I also read #80 and saw that the values are different between #80 and main branch. Based on this, I do a research how other parties are doing CPU calculation:

Dashboard to test on your own

{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "Prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__elements": {},
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "10.2.2"
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    },
    {
      "type": "panel",
      "id": "table",
      "name": "Table",
      "version": ""
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "custom": {
            "align": "auto",
            "cellOptions": {
              "type": "auto"
            },
            "filterable": true,
            "inspect": false
          },
          "decimals": 3,
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "Value"
            },
            "properties": [
              {
                "id": "unit"
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 23,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 1,
      "options": {
        "cellHeight": "sm",
        "footer": {
          "countRows": false,
          "fields": "",
          "reducer": [
            "sum"
          ],
          "show": false
        },
        "frameIndex": 0,
        "showHeader": true
      },
      "pluginVersion": "10.2.2",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) by (instance) ",
          "format": "table",
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "main"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by (instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "PR"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(irate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by(instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node_exporter_full"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "Prometheus Alerts"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\",mode!=\"steal\"}[$__rate_interval]))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "kubernetes-mixin"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "          sum by (instance) (\n            (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval])))\n          / ignoring(cpu) group_left\n            count without (cpu, mode) (node_cpu_seconds_total{mode=\"idle\"})\n          )",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin D"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "1 - avg by (instance) (\n sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval]))\n)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin R"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval]))))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin A"
        }
      ],
      "title": "Panel Title",
      "transformations": [
        {
          "id": "merge",
          "options": {}
        },
        {
          "id": "organize",
          "options": {
            "excludeByName": {
              "Time": true
            },
            "indexByName": {
              "Time": 0,
              "Value #PR": 2,
              "Value #Prometheus Alerts": 4,
              "Value #kubernetes-mixin": 9,
              "Value #main": 8,
              "Value #node-mixin A": 5,
              "Value #node-mixin D": 6,
              "Value #node-mixin R": 7,
              "Value #node_exporter_full": 3,
              "instance": 1
            },
            "renameByName": {}
          }
        }
      ],
      "type": "table"
    }
  ],
  "refresh": "",
  "schemaVersion": 38,
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "CPU test",
  "uid": "f551a6d1-ff6e-45b1-a7a0-84cf70124b75",
  "version": 3,
  "weekStart": ""
}

main branch

avg(1-rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) by (instance)

PR

avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)

node_exporter full dashboard is using

avg(irate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by(instance)

Awesome Prometheus alerts:

sum by (cluster, instance) (avg by (mode, cluster, instance) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])))

kubernetes-mixin

Note: 100% = 1 Core

sum by (cluster,instance) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[$__rate_interval]))

node-mixin (official node_exporter)

recording rule

1 - avg without (cpu) (
 sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval]))
)

alerting rule

sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[2m])))

dashboard

          sum by (instance) (
            (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval])))
          / ignoring(cpu) group_left
            count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
          )

I could only do a test with an small subset of nodes:

% k top node
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-opsstack-21479518-vmss000000   251m         13%    3573Mi          28%       
aks-opsstack-21479518-vmss000001   600m         31%    7863Mi          63%

But the values from #80 are different compare to kubectl top

Thomas · Answer 5 · Wed Dec 20 2023 20:56:53 GMT+0800 (China Standard Time)

Thanks for the comprehensive write up @jkroepke! Given most dashboards use mode!="idle", it looks like this change is probably the right thing to do then? It's also what is recommended by Tigera as linked in the PR.

With respect to kubectl top, I think it's known that these values can differ. I believe it's just the different time intervals and the way the metrics are collected?

Jan-Otto Kröpke · Answer 6 · Wed Dec 20 2023 21:07:34 GMT+0800 (China Standard Time)

At least the queries from #81 give me incorrect results

PR

vs

# k top node
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-opsstack-21479518-vmss000000   232m         12%    3619Mi          29%       
aks-opsstack-21479518-vmss000001   613m         32%    7920Mi          64%

In my post before, I do more a brain dump of my findings. Maybe #81 is wrongly interpret and to solve the issue from @uhthomas, I figure out some alternatives.

Since only he has the issue, he should have to tests some queries.

Thomas · Answer 7 · Wed Dec 20 2023 21:48:42 GMT+0800 (China Standard Time)

I am not sure your evaluation is fair. They are taking measurements at different points in time and does not mean the query in #81 is incorrect. If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.

Thomas · Answer 8 · Wed Dec 20 2023 22:03:32 GMT+0800 (China Standard Time)

I think the numbers look a bit weird because they are averaged, maybe not properly. The usage across cores varies quite widely:

Averaged by core:

Averaged by all:

Jan-Otto Kröpke · Answer 9 · Wed Dec 20 2023 22:03:45 GMT+0800 (China Standard Time)

They are taking measurements at different points in time

Compared to kubectl top, yes.

But the queries from the dashboard based on the same datapoints. The dashboard on main branch show me 30%-35% CPU usage which is way more as the reported 4.2% from #81. The CPU on the system has a constant usage between 30%-35%. over hours. 4% is not possible. This is why, I declare the query as incorrect.

If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.

I would say 5 of 8 queries are true-ish. All values based on the exact same datapoints from different instances.

Thomas · Answer 10 · Wed Dec 20 2023 22:15:20 GMT+0800 (China Standard Time)

@jkroepke I do see what you mean. If you read my previous comment, it may best to calculate average cpu usage by core (avg(sum by (cpu) (...))). I can make this change in the PR and it should be more accurate.

The original query vs the query avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval]))):

The other option is to change the graph to measure different CPU modes, or even cores? That wasn't really its intent though I guess.

Jan-Otto Kröpke · Answer 11 · Wed Dec 20 2023 22:18:24 GMT+0800 (China Standard Time)

If you read my previous comment, it may best to calculate average cpu usage by core ...

Yeah, on the Averaged by all dashboard, you do avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)

This mean, an average of all CPU modes. If you have user 30% system 0% and iowait 0%, you have 30/3=10% CPU usage.

The query was used on the node dashboard.

Thomas · Answer 12 · Wed Dec 20 2023 22:21:44 GMT+0800 (China Standard Time)

@jkroepke Agree. Would you be able to test the most recent changes in #81?

Jan-Otto Kröpke · Answer 13 · Wed Dec 20 2023 22:36:30 GMT+0800 (China Standard Time)

It looks much better now. 👍

Jan-Otto Kröpke · Answer 14 · Wed Dec 20 2023 22:40:57 GMT+0800 (China Standard Time)

FYI: While I do some research for the queries, I found prometheus/node_exporter#2194.

I created a separate issue for this: #86

David Calvert · Answer 15 · Fri Dec 22 2023 16:16:28 GMT+0800 (China Standard Time)

This is an interesting discussion, I didn't have time to make some benchs, but It looks promising!

I tried the latest version and still see a big difference in the resulting values on my side (~ x3).
I will need to deep dive to make sure we get it right (most probably in January).

Jan-Otto Kröpke · Answer 16 · Fri Dec 22 2023 18:19:52 GMT+0800 (China Standard Time)

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

David Calvert · Answer 17 · Sat Dec 23 2023 02:41:26 GMT+0800 (China Standard Time)

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

Yes it was the latest, CPU usage is 3x higher on the new version.
I'll need to check/compare to find which query is the closest to the reality.

Thomas · Answer 18 · Sat Dec 23 2023 02:44:04 GMT+0800 (China Standard Time)

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

Yes it was the latest, CPU usage is 3x higher on the new version.

I'll need to check/compare to find which query is the closest to the reality.

Would you be able to attach a screenshot? The amended query should be more accurate.

David Calvert · Answer 19 · Sat Dec 23 2023 04:27:26 GMT+0800 (China Standard Time)

As you can see, the values are quite different (at least on my side).
Comparing the results with trusted system tools or software can help I think.
I'm pretty sure I did that a long time ago, and it looked good to me, but maybe it's wrong...

We should definitely take the time to get this right.

PS: I don't think I will have time to look further before January 🎄 🥳

Thomas · Answer 20 · Sat Dec 23 2023 04:30:47 GMT+0800 (China Standard Time)

Thanks for your help @dotdc - that is interesting. I would be eager to see the individual values for the different modes on your cores. I wonder if the system is busy in the other idle states iowait and steal?

David Calvert · Answer 21 · Sat Dec 23 2023 04:39:12 GMT+0800 (China Standard Time)

Something like this?

Thomas · Answer 22 · Sat Dec 23 2023 04:42:11 GMT+0800 (China Standard Time)

Yes, exactly, but maybe with distinct colours for the values?

David Calvert · Answer 23 · Sat Dec 23 2023 04:47:40 GMT+0800 (China Standard Time)

This is the best I can do right now:

Thomas · Answer 24 · Sat Dec 23 2023 04:49:32 GMT+0800 (China Standard Time)

Again, appreciate your help. Enjoy your winter break @dotdc! 😄

David Calvert · Answer 25 · Sat Dec 23 2023 04:52:59 GMT+0800 (China Standard Time)

Thanks, you too!

Jan-Otto Kröpke · Answer 26 · Sat Dec 23 2023 07:54:55 GMT+0800 (China Standard Time)

This is the best I can do right now:

Could you please this, without excluding iowait and steal? Or better: the same graph, but only with the both modes. Are they mentionable values?

Thomas · Answer 27 · Mon Dec 25 2023 08:03:56 GMT+0800 (China Standard Time)

This query could possibly be helpful? If there is a lot of cpu time spent on steal or iowait, then it would make sense that comparing to just idle would produce a wildly different graph.

avg by (mode) (rate(node_cpu_seconds_total{mode=~"steal|iowait"}))

I have a feeling the new query is a more accurate representation of actual CPU usage - but the graphs shown here look very suspect.

The same graph on my cluster also shows discrepancies (as expected), but not to such huge degrees.

(new on bottom)

Thomas · Answer 28 · Thu Dec 28 2023 07:24:16 GMT+0800 (China Standard Time)

Please also see these dashboards side-by-side. The first one is the original, the second matches everything but "idle" and the third one is the current query which matches everything but "idle", "iowait" and "steal". The final graph shows CPU usage across the whole cluster by namespace. There is a clear discrepancy, and the third graph seems the most accurate to me.

For context, there are 20 allocatable cpus on the cluster. I do not see how 50% utilisation could ever make sense.

The following is the same as the original image, but with stacked cpu usage to demonstrate that a value of 50% is unrealistic.

Thomas · Answer 29 · Thu Dec 28 2023 07:29:57 GMT+0800 (China Standard Time)

This final image may also be helpful. It shows there was a spike in iowait, which is not currently accounted for.

David Calvert · Answer 30 · Thu Jan 04 2024 19:32:11 GMT+0800 (China Standard Time)

Hi @uhthomas,

I made a limited number of additional tests this morning.
Your query is good on the nodes dashboard, but the differences from my previous screenshots remains on the global view.

I've managed to get closer to your values by dividing the result by the number of nodes.

avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!~"idle|steal|iowait", cluster="$cluster"}[$__rate_interval]))) / count(count by (node) (kube_node_info{cluster="$cluster"}))

This should work for clusters that have homogenous nodes flavors across nodepools, but I have concerns on clusters that have heterogenous nodepools/flavors.

Could you double-check this on your setup?
Also, do you have a cluster with different node flavors to see how this behaves?

Screenshot:

Thomas · Answer 31 · Thu Jan 04 2024 23:20:20 GMT+0800 (China Standard Time)

@dotdc I am currently running a single node Kubernetes cluster, I was not aware of this limitation. I imagine what's happening here is it should be sum by (node, cpu). I can fix this when I get back later, and should resolve the issue you're seeing 😄

Thomas · Answer 32 · Thu Jan 04 2024 23:20:50 GMT+0800 (China Standard Time)

Would you be able to test it for me in the meantime?

Thomas · Answer 33 · Fri Jan 05 2024 23:02:52 GMT+0800 (China Standard Time)

I've updated the PR @dotdc

David Calvert · Answer 34 · Sat Jan 06 2024 00:24:28 GMT+0800 (China Standard Time)

This was great, thank you both @uhthomas & @jkroepke !

David Calvert · Answer 35 · Fri Apr 26 2024 05:11:00 GMT+0800 (China Standard Time)

🎉 This issue has been resolved in version 1.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀