dotdc / grafana-dashboards-kubernetes

A set of modern Grafana dashboards for Kubernetes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[bug] CPU dashboard can report negative values

uhthomas opened this issue · comments

Describe the bug

image

How to reproduce?

I don't know

Expected behavior

The dashboard should not produce negative CPU usage values.

Additional context

I adjusted some resource limits, which caused some pods to restart.

I don't really understand why.

image

Looks normal when switching the query to avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval]))

image

versus

image

Hi @uhthomas,
Thank you for opening this issue.
The values are quite different, did you compare them with other system tools or the metrics from the metrics-server (kubectl top)?

I will need to make some tests before approving your PR.

Hi,

I also read #80 and saw that the values are different between #80 and main branch. Based on this, I do a research how other parties are doing CPU calculation:

image
Dashboard to test on your own
{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "Prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__elements": {},
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "10.2.2"
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    },
    {
      "type": "panel",
      "id": "table",
      "name": "Table",
      "version": ""
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "custom": {
            "align": "auto",
            "cellOptions": {
              "type": "auto"
            },
            "filterable": true,
            "inspect": false
          },
          "decimals": 3,
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "Value"
            },
            "properties": [
              {
                "id": "unit"
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 23,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 1,
      "options": {
        "cellHeight": "sm",
        "footer": {
          "countRows": false,
          "fields": "",
          "reducer": [
            "sum"
          ],
          "show": false
        },
        "frameIndex": 0,
        "showHeader": true
      },
      "pluginVersion": "10.2.2",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) by (instance) ",
          "format": "table",
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "main"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by (instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "PR"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(irate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by(instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node_exporter_full"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "Prometheus Alerts"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\",mode!=\"steal\"}[$__rate_interval]))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "kubernetes-mixin"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "          sum by (instance) (\n            (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval])))\n          / ignoring(cpu) group_left\n            count without (cpu, mode) (node_cpu_seconds_total{mode=\"idle\"})\n          )",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin D"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "1 - avg by (instance) (\n sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval]))\n)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin R"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval]))))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin A"
        }
      ],
      "title": "Panel Title",
      "transformations": [
        {
          "id": "merge",
          "options": {}
        },
        {
          "id": "organize",
          "options": {
            "excludeByName": {
              "Time": true
            },
            "indexByName": {
              "Time": 0,
              "Value #PR": 2,
              "Value #Prometheus Alerts": 4,
              "Value #kubernetes-mixin": 9,
              "Value #main": 8,
              "Value #node-mixin A": 5,
              "Value #node-mixin D": 6,
              "Value #node-mixin R": 7,
              "Value #node_exporter_full": 3,
              "instance": 1
            },
            "renameByName": {}
          }
        }
      ],
      "type": "table"
    }
  ],
  "refresh": "",
  "schemaVersion": 38,
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "CPU test",
  "uid": "f551a6d1-ff6e-45b1-a7a0-84cf70124b75",
  "version": 3,
  "weekStart": ""
}

main branch

avg(1-rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) by (instance) 

PR

avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)

node_exporter full dashboard is using

avg(irate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by(instance)

Awesome Prometheus alerts:

sum by (cluster, instance) (avg by (mode, cluster, instance) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])))

kubernetes-mixin

Note: 100% = 1 Core

sum by (cluster,instance) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[$__rate_interval]))

node-mixin (official node_exporter)

recording rule

1 - avg without (cpu) (
 sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval]))
)

alerting rule

sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[2m])))

dashboard

          sum by (instance) (
            (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval])))
          / ignoring(cpu) group_left
            count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
          )

I could only do a test with an small subset of nodes:

image
% k top node
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-opsstack-21479518-vmss000000   251m         13%    3573Mi          28%       
aks-opsstack-21479518-vmss000001   600m         31%    7863Mi          63% 

But the values from #80 are different compare to kubectl top


Thanks for the comprehensive write up @jkroepke! Given most dashboards use mode!="idle", it looks like this change is probably the right thing to do then? It's also what is recommended by Tigera as linked in the PR.

With respect to kubectl top, I think it's known that these values can differ. I believe it's just the different time intervals and the way the metrics are collected?

At least the queries from #81 give me incorrect results

PR

image

vs

# k top node
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-opsstack-21479518-vmss000000   232m         12%    3619Mi          29%       
aks-opsstack-21479518-vmss000001   613m         32%    7920Mi          64%   

In my post before, I do more a brain dump of my findings. Maybe #81 is wrongly interpret and to solve the issue from @uhthomas, I figure out some alternatives.

Since only he has the issue, he should have to tests some queries.

I am not sure your evaluation is fair. They are taking measurements at different points in time and does not mean the query in #81 is incorrect. If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.

I think the numbers look a bit weird because they are averaged, maybe not properly. The usage across cores varies quite widely:

image

Averaged by core:

image

Averaged by all:

image

They are taking measurements at different points in time

Compared to kubectl top, yes.

But the queries from the dashboard based on the same datapoints. The dashboard on main branch show me 30%-35% CPU usage which is way more as the reported 4.2% from #81. The CPU on the system has a constant usage between 30%-35%. over hours. 4% is not possible. This is why, I declare the query as incorrect.

If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.

I would say 5 of 8 queries are true-ish. All values based on the exact same datapoints from different instances.

@jkroepke I do see what you mean. If you read my previous comment, it may best to calculate average cpu usage by core (avg(sum by (cpu) (...))). I can make this change in the PR and it should be more accurate.

The original query vs the query avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval]))):

image

The other option is to change the graph to measure different CPU modes, or even cores? That wasn't really its intent though I guess.

If you read my previous comment, it may best to calculate average cpu usage by core ...

Yeah, on the Averaged by all dashboard, you do avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)

This mean, an average of all CPU modes. If you have user 30% system 0% and iowait 0%, you have 30/3=10% CPU usage.

The query was used on the node dashboard.

@jkroepke Agree. Would you be able to test the most recent changes in #81?

It looks much better now. 👍

FYI: While I do some research for the queries, I found prometheus/node_exporter#2194.

I created a separate issue for this: #86

This is an interesting discussion, I didn't have time to make some benchs, but It looks promising!

I tried the latest version and still see a big difference in the resulting values on my side (~ x3).
I will need to deep dive to make sure we get it right (most probably in January).

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

Yes it was the latest, CPU usage is 3x higher on the new version.
I'll need to check/compare to find which query is the closest to the reality.

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

Yes it was the latest, CPU usage is 3x higher on the new version.

I'll need to check/compare to find which query is the closest to the reality.

Would you be able to attach a screenshot? The amended query should be more accurate.

image

As you can see, the values are quite different (at least on my side).
Comparing the results with trusted system tools or software can help I think.
I'm pretty sure I did that a long time ago, and it looked good to me, but maybe it's wrong...

We should definitely take the time to get this right.

PS: I don't think I will have time to look further before January 🎄 🥳

Thanks for your help @dotdc - that is interesting. I would be eager to see the individual values for the different modes on your cores. I wonder if the system is busy in the other idle states iowait and steal?

Something like this?

image

Yes, exactly, but maybe with distinct colours for the values?

This is the best I can do right now:

image

Again, appreciate your help. Enjoy your winter break @dotdc! 😄

Thanks, you too!

This is the best I can do right now:

image

Could you please this, without excluding iowait and steal? Or better: the same graph, but only with the both modes. Are they mentionable values?

This query could possibly be helpful? If there is a lot of cpu time spent on steal or iowait, then it would make sense that comparing to just idle would produce a wildly different graph.

avg by (mode) (rate(node_cpu_seconds_total{mode=~"steal|iowait"}))

image

I have a feeling the new query is a more accurate representation of actual CPU usage - but the graphs shown here look very suspect.

The same graph on my cluster also shows discrepancies (as expected), but not to such huge degrees.

(new on bottom)

image

Please also see these dashboards side-by-side. The first one is the original, the second matches everything but "idle" and the third one is the current query which matches everything but "idle", "iowait" and "steal". The final graph shows CPU usage across the whole cluster by namespace. There is a clear discrepancy, and the third graph seems the most accurate to me.

image

For context, there are 20 allocatable cpus on the cluster. I do not see how 50% utilisation could ever make sense.

image

The following is the same as the original image, but with stacked cpu usage to demonstrate that a value of 50% is unrealistic.

image

This final image may also be helpful. It shows there was a spike in iowait, which is not currently accounted for.

image

Hi @uhthomas,

I made a limited number of additional tests this morning.
Your query is good on the nodes dashboard, but the differences from my previous screenshots remains on the global view.

I've managed to get closer to your values by dividing the result by the number of nodes.

avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!~"idle|steal|iowait", cluster="$cluster"}[$__rate_interval]))) / count(count by (node) (kube_node_info{cluster="$cluster"}))

This should work for clusters that have homogenous nodes flavors across nodepools, but I have concerns on clusters that have heterogenous nodepools/flavors.

Could you double-check this on your setup?
Also, do you have a cluster with different node flavors to see how this behaves?

Screenshot:
image

@dotdc I am currently running a single node Kubernetes cluster, I was not aware of this limitation. I imagine what's happening here is it should be sum by (node, cpu). I can fix this when I get back later, and should resolve the issue you're seeing 😄

Would you be able to test it for me in the meantime?

I've updated the PR @dotdc

This was great, thank you both @uhthomas & @jkroepke !

🎉 This issue has been resolved in version 1.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀