kubernetes / kubernetes

What happened?

The CPU utilization metrics from Kubelet summary API CPUStats.UsageCoreNanoSeconds is not accurate on Windows. The value is only 1/10 of the real value which I have other VM level metrics as a proof.

summary API CPU utilization metric

VM CPU utilization metric

What did you expect to happen?

The CPUStats.UsageCoreNanoSeconds should be accurate on Windows too.

How can we reproduce it (as minimally and precisely as possible)?

$ kubectl get --raw "/api/v1/nodes/<NODE>/proxy/stats/summary"

T1

  "cpu": {
   "time": "2023-12-06T21:23:48Z",
   "usageNanoCores": 8000000,
   "usageCoreNanoSeconds": 20985800000000
  },

T2

  "cpu": {
   "time": "2023-12-06T21:24:38Z",
   "usageNanoCores": 8000000,
   "usageCoreNanoSeconds": 20986320000000
  },

(20986320000000 - 20985800000000)ns / 50s = 0.01 s/s
This matches the magnitude of the metric chart.

Anything else we need to know?

This problem doesn't exist on Linux.

Call stack

metric is set here by kubelet -

kubernetes/pkg/kubelet/winstats/winstats.go

Lines 118 to 129 in 7fe31be

    
           func (c *StatsClient) createRootContainerInfo() (*cadvisorapiv2.ContainerInfo, error) { 
        
           	nodeMetrics, err := c.client.getNodeMetrics() 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	var stats []*cadvisorapiv2.ContainerStats 
        
           	stats = append(stats, &cadvisorapiv2.ContainerStats{ 
        
           		Timestamp: nodeMetrics.timeStamp, 
        
           		Cpu: &cadvisorapi.CpuStats{ 
        
           			Usage: cadvisorapi.CpuUsage{ 
        
           				Total: nodeMetrics.cpuUsageCoreNanoSeconds,

the data collected by using windows perf counters -

kubernetes/pkg/kubelet/winstats/perfcounter_nodestats.go

Lines 206 to 212 in 7fe31be

    
           cpuValue, err := cpuCounter.getData() 
        
           cpuCores := ProcessorCount() 
        
           if err != nil { 
        
           	klog.ErrorS(err, "Unable to get cpu perf counter data") 
        
           	return 
        
           }

query here -

kubernetes/pkg/kubelet/winstats/perfcounters.go

Line 32 in 7fe31be

cpuQuery = "\\Processor(_Total)\\% Processor Time"

Kubernetes version

$ kubectl version
1.27

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Could usageNanoCores be wrong?

This code caught my attention:

kubernetes/pkg/kubelet/winstats/perfcounter_nodestats.go

Lines 252 to 254 in 85684ec

    
           cachePeriodSeconds := uint64(defaultCachePeriod / time.Second) 
        
           perfCounterUpdatePeriodSeconds := uint64(perfCounterUpdatePeriod / time.Second) 
        
           cpuUsageNanoCores := ((p.cpuUsageCoreNanoSecondsCache.latestValue - p.cpuUsageCoreNanoSecondsCache.previousValue) * perfCounterUpdatePeriodSeconds) / cachePeriodSeconds

usageNanoCores is computed in cri_stats_provider.go, like this:

kubernetes/pkg/kubelet/stats/cri_stats_provider.go

Lines 792 to 797 in 26923b9

    
           nanoSeconds := newStats.Timestamp - cachedStats.Timestamp 
        
           if nanoSeconds <= 0 { 
        
           	return nil, fmt.Errorf("zero or negative interval (%v - %v)", newStats.Timestamp, cachedStats.Timestamp) 
        
           } 
        
           usageNanoCores := uint64(float64(newStats.UsageCoreNanoSeconds.Value-cachedStats.UsageCoreNanoSeconds.Value) / 
        
           	float64(nanoSeconds) * float64(time.Second/time.Nanosecond))

So wouldn't something like this be equivalent (just going by looking at the code)?

cpuUsageNanoCores := (p.cpuUsageCoreNanoSecondsCache.latestValue - p.cpuUsageCoreNanoSecondsCache.previousValue) / float64(perfCounterUpdatePeriod) * float64(time.Second/time.Nanosecond)

/sig windows

you can open discusion with https://github.com/kubernetes/community/tree/master/sig-windows

	func (c StatsClient) createRootContainerInfo() (cadvisorapiv2.ContainerInfo, error) {
	nodeMetrics, err := c.client.getNodeMetrics()
	if err != nil {
	return nil, err
	}

	var stats []*cadvisorapiv2.ContainerStats
	stats = append(stats, &cadvisorapiv2.ContainerStats{
	Timestamp: nodeMetrics.timeStamp,
	Cpu: &cadvisorapi.CpuStats{
	Usage: cadvisorapi.CpuUsage{
	Total: nodeMetrics.cpuUsageCoreNanoSeconds,

	cpuValue, err := cpuCounter.getData()
	cpuCores := ProcessorCount()
	if err != nil {
	klog.ErrorS(err, "Unable to get cpu perf counter data")
	return
	}

	cachePeriodSeconds := uint64(defaultCachePeriod / time.Second)
	perfCounterUpdatePeriodSeconds := uint64(perfCounterUpdatePeriod / time.Second)
	cpuUsageNanoCores := ((p.cpuUsageCoreNanoSecondsCache.latestValue - p.cpuUsageCoreNanoSecondsCache.previousValue) * perfCounterUpdatePeriodSeconds) / cachePeriodSeconds

	nanoSeconds := newStats.Timestamp - cachedStats.Timestamp
	if nanoSeconds <= 0 {
	return nil, fmt.Errorf("zero or negative interval (%v - %v)", newStats.Timestamp, cachedStats.Timestamp)
	}
	usageNanoCores := uint64(float64(newStats.UsageCoreNanoSeconds.Value-cachedStats.UsageCoreNanoSeconds.Value) /
	float64(nanoSeconds) * float64(time.Second/time.Nanosecond))

CPUStats.UsageCoreNanoSeconds is inaccurate on windows - 1/10 of the real value.