kubernetes / kubernetes

Production-Grade Container Scheduling and Management

Home Page:https://kubernetes.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reduce the set of metrics exposed by the kubelet

dashpole opened this issue · comments

Background

In 1.12, the kubelet exposes a number of sources for metrics directly from cAdvisor. This includes:

The kubelet also exposes the summary API, which is not exposed directly by cAdvisor, but queries cAdvisor as one of its sources for metrics.

The Monitoring Architecture documentation describes the path for "core" metrics, and for "monitoring" metrics. The Core Metrics proposal describes the set of metrics that we consider core, and their uses. The motivation for the split architecture is:

  • To minimize the performance impact of stats collection for core metrics, allowing these to be collected more frequently
  • To make the monitoring pipeline replaceable, and extensible.

Current kubelet metrics that are not included in core metrics

  • Pod and Node-level Network Metrics
  • Persistent Volume Metrics
  • Container-level (Nvidia) GPU Metrics
  • Node-Level RLimit Metrics
  • Misc Memory Metrics (e.g. PageFaults)
  • Container, Pod, and Node-level Inode metrics (for ephemeral storage)
  • Container, Pod, and Node-level DiskIO metrics (from cAdvisor)

Deprecating and removing the Summary API will require out-of-tree sources for each of these metrics. "Direct" cAdvisor endpoints are not often used, and have even been broken for multiple releases (#62544) without anyone raising an issue.

Working Items

  • [1.13] Introduce Kubelet pod-resources grpc endpoint; KEP: kubernetes/community#2454
  • [1.14] Introduce Kubelet Resource Metrics API
  • [1.15] Deprecate the "direct" cAdvisor API endpoints by adding and deprecating a --enable-cadvisor-json-endpoints flag
  • [1.18] Default the --enable-cadvisor-json-endpoints flag to disabled
  • [1.21] Remove the --enable-cadvisor-json-endpoints flag
  • [1.21] Transition Monitoring Server to Kubelet Resource Metrics API (requires 3 versions skew)
  • [TBD] Propose out-of-tree replacements for kubelet monitoring endpoints
  • [TBD] Deprecate the Summary API and cAdvisor prometheus endoints by adding and deprecating a --enable-container-monitoring-endpoints flag
  • [TBD+2] Remove "direct" cAdvisor API endpoints
  • [TBD+2] Default the --enable-container-monitoring-endpoints flag to disabled
  • [TBD+4] Remove the Summary API, cAdvisor prometheus metrics and remove the --enable-container-monitoring-endpoints flag.

Open Questions

  • Should the kubelet be a source for any monitoring metrics?
    • For example, metrics about the kubelet itself, or DiskIO metrics for empty-dir volumes (which are "owned" by the kubelet).
  • What will provide the metrics listed above, now that the kubelet no longer does?
    • cAdvisor can provide Network, RLimit, Misc Memory metrics, Inode metrics, and DiskIO metrics.
      • cAdvisor only works for some runtimes, but is a drop-in replacement for "direct" cAdvisor API endpoints
    • Container Runtimes can be a source for container-level Memory, Inode, Network and DiskIO metrics.
    • NVidia GPU metrics provided by a daemonset published by NVidia
    • No source for Persistent Volume metrics?

/sig node
/sig instrumentation
/kind feature
/priority important-longterm
cc @kubernetes/sig-node-proposals @kubernetes/sig-instrumentation-misc

Shouldn't be container/pod metrics gathered from CRI instead of from cAdvisor? Same with info about inodes/bytes usage of image store and for container root filesystems.

That's strongly runtime dependent and we should not expect that cAdvisor will support all types of runtimes supported by CRI (e.g. VM based runtimes like https://github.com/Mirantis/virtlet/ ).

cc @brancz

Introduce Core Metrics Kubelet API

Is there a proposal out for this, besides the design/scratch doc I put out a while ago on using Prometheus metrics for this purpose?

I don’t think there is anything else.

Shouldn't be container/pod metrics gathered from CRI instead of from cAdvisor? Same with info about inodes/bytes usage of image store and for container root filesystems.

@jellonek you are correct. That is why I list cAdvisor as "one of its sources for metrics". The other notable source is metrics from the runtime through CRI.
I added the container runtime as a source of metrics that the kubelet won't provide in the future.

I don't think we should introduce the core metrics API as described in the proposal you linked above. It's basically just summary 2.0 in terms of format, which makes it highly kube-specific. I'd much rather just have a stable Prometheus endpoint, as discussed previously. That way, tools that collect metrics are much less likely to have to learn a new format.

EDIT: too much speed reading

I don't think we should introduce the core metrics API as described in the proposal you linked above. It's basically just summary 2.0 in terms of format, which makes it highly kube-specific.

The format of the core metrics API is one of the non-goals listed in the proposal linked above. It just lists the contents of the API. We can debate format during the proposal process.
I intend for this to be an umbrella issue. I am trying to document current state, rough timeline, and gaps I see in terms of which metrics don't have a transition path to the new architecture.

Ah, apologies -- too much speed reading plus trying to recall from the last time I looked at things :-)

hi~I would like to know how to Deprecate the "direct" cAdvisor API endpoints?
Is that deprecating the port for direct cAdvisor in kubelet first, and then removing the cAdvisor vendor codes from kubelet after all?

removing the vendored cadvisor would require making sure we have out-of-tree sources for all the metrics. Deprecating "direct" cAdvisor API endpoints is mostly just about deprecating that piece of API, I believe. Removing the rest will come later.

Should the kubelet be a source for any monitoring metrics?

  • For example, metrics about the kubelet itself, or DiskIO metrics for empty-dir volumes (which are "owned" by the kubelet).

Given that I've been unable to find an accurate source of Prometheus-formatted metrics for empty-dirs, I am an enthusiastic +1 to including these. (Does anyone know if these are exported elsewhere? The exported cadvisor disk usage metrics in 1.8.7 appear to exclude any overlays and are useless for monitoring real container disk usage.)

Given that I've been unable to find an accurate source of Prometheus-formatted metrics for empty-dirs

The kubelet's /metrics endpoint has disk usage metrics for volumes. cAdvisor's disk usage metrics just measure the writable layer of the container.

@dashpole the linked file appears to only collect PVC stats, not emptyDir/local storage metrics.

@dashpole I think there is a small typo in the first point of description of this issue. Shouldn't that be /metrics/cadvisor https://github.com/kubernetes/kubernetes/blob/v1.12.0/pkg/kubelet/server/server.go#L71

Edit: the issue description has been updated.

/cc

@dashpole Hello, as a fresh man here, I want to clarify if my understanding is correct about this issue.

  1. Currently summary api endpoint provides redundancy metrics and we should replace it with some kind of new endpoint which only provides metrics needed by components like metrics-server, scheduler,HPA...
  2. After removing the summary endpoint as well as direct cadvisor endpoint, by default, the only way kubelet provides to get metrics information is the new endpoint.
  3. If users want more metrics support , they should run Daemonset as the source of metrics.

If the understanding above is correct, I have a few of questions about them.

  1. What kind of metrics the new endpoint would contain, it seems the description in https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/core-metrics-pipeline.md#metric-requirements gives the answer:
Node-level usage metrics for Filesystems, CPU, and Memory
Pod-level usage metrics for Filesystems, CPU, and Memory
Container-level usage metrics for Filesystems, CPU, and Memory
  1. should we remove cadvisor totally to reduce the load of kubelet/node? Or just disable the endpoint(/stats/summary,/stats/container,/stats/podName/containerName) and still use cadvisor as metrics source consumed by new endpoint.
  2. If we should remove cadvisor totally, where could the new endpoint get metrics from? Should we implement some kind of "light-version cadvisor" which collect metrics consumed by that new endpoint (those in point 1)?

@WanLinghao those are good questions. I have some ideas, but nothing is set in stone yet.

Understanding

  1. Information in the summary API isn't redundant, but the kubelet should be scoped as narrowly as possible.

  2. The kubelet may, or may not provide some supplemental metrics in prometheus format, depending on which metrics require knowledge only the kubelet has ("ephemeral storage", for example, is defined by the kubelet). The short answer is that this is still up in the air, but the goal is for each metric in the summary API to exist afterwards, though possibly not from the kubelet.

  3. That is one possible solution, but isn't determined yet

Questions

  1. I am working on a proposal for this right now. You are correct that it will contain the metrics described in the core metrics proposal.

  2. The eventual goal is to remove cAdvisor, but it is likely it will be around for some time.

  3. At a high level, cAdvisor is both a cgroup discovery tool, and a cgroup monitoring tool. It discovers cgroups by recursively watching all cgroup directories using inotify, and responding to inotify events by monitoring the discovered cgroup. Since the kubelet already manages pod, allocatable (/kubepods), QoS, kubelet, runtime and misc cgroups, it doesn't need any discovery for those. The kubelet also already gets container metrics from the CRI, so it doesn't need discovery or monitoring for those. The only thing left is adding minimal monitoring to the container manager in the kubelet, which should be ~100 lines of code, rather than an entire cAdvisor. As stated before, this is not set in stone, but as a worst-case design is a marked improvement over the current state.

@dashpole , and if a person wants to use Prometheus for monitoring then who will provide the Prometheus formatted metrics? currently these are exposed throgh /metrics/cadvisor end point on kubelet.

would we have to run the cAdvisor as deamonset for that purpose?

@dashpole So why would we remove the summary API?

Information in the summary API isn't redundant, but the kubelet should be scoped as narrowly as possible.

It seems we will replace summary API with Core Metrics kubelet API. What's the difference between them?

[1.14] Introduce Core Metrics Kubelet API
[1.14] Deprecate the Summary API

@sachinmsft that is not yet entirely determined. If you want the exact metrics cAdvisor produces, you will likely have to run cAdvisor as a daemonset. However, my hope is that we can find a better way which will involve a combination of metrics from the kubelet and from other sources.

@WanLinghao The Core metrics api has more limited metrics, as required by the metrics server.

@WanLinghao The Core metrics api has more limited metrics, as required by the metrics server.

Hi dashpole
I am a beginner with Kubernetes API server usage for metrics collection. I have been trying to collect various metrics data using the "k8s.io/metrics/" metrics Golang package. So far I am able to get only few metrics like CPU, Memory and Ephemeral Storage of the containers. I would like to know how should I go about collecting metrics about persistent volume usage of a Pod(if I get it per container that would be al-right, I can do the addition :)) as well as "latency"(time taken to serve an api request) for a Pod(I would like to either find the average latency or latency per api call). Can you point me towards the correct way to get those metrics ? By the way my metrics collector application is deployed as a POD itself which will forward the metrics data to a renderer application for creating visualization. Thanks in advance.

@khsahaji please ping me on slack. That isn't relevant to this issue.

/milestone v1.14
We're entering burndown and this looks relevant to the 1.14 release. Please /milestone clear if I am incorrect

Hey!

Gentle reminder that code freeze is this Friday.

Since this is part of the 1.14 milestone, is it going to be resolved soon?

@ibrasho The only PR for 1.14 is #73946. Then this can be moved to 1.15

/milestone v1.15

If you want the exact metrics cAdvisor produces, you will likely have to run cAdvisor as a daemonset.

Running cAdvisor on its own may produce the same numbers, but it does not label them in the way Kubelet does with pod, container, etc., so is not a replacement.

Unless I'm missing something, we would need a new program which duplicates this part of Kubelet+cAdvisor.

I may be wrong, but last I checked, all Kubernetes does is re-write the labels available well-known docker labels to pod_name, pod, container_name, and container, that would be possible to be replicated with metric relabelling in Prometheus. I agree that the exact same response from cAdvisor directly is not currently possible, although it could be by implementing label replacing directly into cAdvisor.

We tried it, and resorted to creating a new program. If you describe how to do it without, that would make a great blog post.

@bboreham I opened google/cadvisor#2227 to track that.

This looks like it should be a KEP and not a tracking issue.

I see:

[1.15] Deprecate the "direct" cAdvisor API endpoints by adding and deprecating a --enable-cadvisor-json-endpoints flag

listed as the work item for 1.15; are there any updates regarding this?

/milestone clear

@justaugustus

This is one of the working items under the metrics overhaul KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20181106-kubernetes-metrics-overhaul.md#export-less-metrics

The 1.15 item was completed in #78504, so this doesn't have anything left for 1.15.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Is it possible to get an update for 1.17?
Which kubelet endpoint should I use for collecting cAdvisor metrics by Prometheus Server?
Which other kubelet endpoints are supported for direct access to additional metrics?
@juliusv
@dashpole

I think I got it now.
This is what I have understood from #59827:
If one needs to collect all cAdvisor metrics with Prometheus, don't scrape kubelet but rather scrape a separate/standalone cAdvisor instance running as daemonset.
I hope this is right :)

I discussed it with David and proposed that I will work on a KEP that will propose replacements for kubelet monitoring endpoints

@dashpole If I want to enable subcontainers options in kubelet api request to collect more metrics about sidecar container, is that conflict with this proposal ?

ref:

	// Whether to also include information from subcontainers.
	// Default: false.
	// +optional
	Subcontainers bool `json:"subcontainers,omitempty"`

Since progress on this has slipped multiple releases, I removed references to specific versions in which some steps are happening. Once we have a plan to move forward, we can update those with the up-to-date plan.

Hey @dashpole is there any update on this? From what I get it seems that it is not clear yet when changes will happen, however is there any resource (design doc, open discussion etc) where we can keep track of updates even if decisions are not final (I'm interested mostly in stats/summary future to be honest)?

@serathius I see that you mentioned about working on proposing replacements for kubelet's monitoring endpoints. If there are any resources to keep track of this one too (even in preliminary state) would be really nice to know. Thanks!

/unassign

I am not able to drive this work right now. This is still the place to check for updates.

@ChrsMark I and my team have been following this as well, and working on replacement options for stats/summary while we wait for guidance from the kubernetes sigs. Happy to discuss what we've looked at and decided on if it would be helpful, and would be interested in hearing others' solutions as well. My impression is that no in-tree replacement for stats/summary is being planned by sig-node or sig-instrumentation.

@jpepin happy to join in such a meeting/design-proposal/discussion if there is any in the future!

@dashpole - Now that --enable-cadvisor-json-endpoints defaults to false with 1.18, whats the recommended equivalent for /spec/ to get machineinfo ? It would be great if we can provide alternatives for each of the endpoints we are deprecating (and removing).

@vishiy You can run cAdvisor as a daemonset to get the exact same information. Or, i'd recommend checking out the prometheus node exporter, which has most of the same information available in prometheus format, which IMO is easier to integrate into popular monitoring pipelines.

@dashpole I understand that external replacements are the way to go. It works well for Linux as we always have the possibility for a container to have a level of host access similar to the Kubelet.

However, I believe that's not an option on Windows. AFAIK, anything running as a container (until privileged containers are supported on Windows) cannot provide the same metrics than the Kubelet (currently network, some disk - excluding cpu, memory as it'd still be available from Kubelet).

I'm trying to understand the changes, does k8s plan to remove CAdviosor metrics from kubelet?
I can deploy a CAdvisor myself as a daemonSet but do I get disk or PVC metrics which I get from kubelet?
What we have today and what we are going to achieve later seems skeptical for me. Could anyone help me to understand?

@dashpole - Thanks. But running yet another pod/workload just for a few metrics and running many for each of these metrics is not scalable/sustainable in my opinion. So unless the metric server has these metrics available as a single consolidator/aggregator, we should probably rethink deprecating the remaining endpoints. It just breaks tooling & solutions people have built and these metrics are super useful. What do you think ?

@dashpole what's the plan here? Is this issue moving forward? Do we need re-discuss it based on recent comments?

I want to make sure this PR #95763 is not going to waste.

@SergeyKanzhelev Thanks for bumping this. Given the process we recently took for Accelerator metrics, it seems like deprecating a set of metrics should have its own KEP. This was originally part of the metrics overhaul, but was removed from that KEP recently since it hadn't been finished. Given the feedback above, it seems reasonable to have a KEP in which people can discuss pros/cons of moving forward with the deprecation, and let the sig leads make the call.

Once we have a KEP for removing the cadvisor json endpoints, we should probably close this issue. I think the question of whether to deprecate and replace endpoints is probably better answered by its own KEP.

got it. Set PR on hold for now. Is this something that sig instrumentation can discuss and push forward?

Sig-node has historically handled questions of which endpoints the kubelet exposes. Sig-Instrumentation should probably be an involved sig

@dashpole What's the best way to get involved with this KEP discussing the deprecation?

Seems relevant to this discussion: apparently a set of metrics like machine_memory_bytes and machine_cpu_cores stopped being generated by kubelet in 1.19, due to some reorganisation inside cAdvisor #95204.

I opened kubernetes/enhancements#2130, which we can use as a place to discuss concerns around removing the cAdvisor json endpoints.

node-exporter maintainer here, just to clarify: the node-exporter doesn't provide any container level metrics

Follow-up from discussion at Sig-Node today:

Just to clarify, kubernetes/enhancements#2130 is trying to consolidate existing endpoints, and isn't intended to remove content. Potential migration paths are covered as part of the KEP, but to summarize, most metrics being removed should be available in the /metrics/cadvisor kubelet endpoint. If you rely on metrics that are not included in /metrics/cadvisor, please describe those on the KEP.

Since the introduction of the CRI, Sig-Node has been interested in removing cAdvisor from the kubelet, since it requires container runtimes to integrate with cAdvisor to get complete metrics. (Since 2016) the desired monitoring architecture has been one in which monitoring metrics are collected out-of-tree. This issue has been tracking the steps we would need to take to reach those goals, but doesn't cover specifics (such as migration steps for users), or guarantee we will ever actually reach the desired monitoring architecture. It may very well be the case (as @smarterclayton pointed out above) that the primary monitoring endpoints are too widely used to break at this point.

As a status update:
If kubernetes/enhancements#2130 is accepted, and the monitoring server is moved to /metrics/resource, we will be in the following state in 1.21:

  • The only remaining container/pod/node monitoring endpoints on the kubelet are /metrics/cadvisor and /stats/summary.
  • Third-party device metrics are no longer collected in-tree.
  • The "monitoring" endpoints listed above are used only for monitoring, and not for kubectl top or autoscaling.
  • cAdvisor is only used to provide data for monitoring endpoints, and not for other endpoints. This may change with the introduction of pod-level metrics.
  • cAdvisor is still used to provide metrics for eviction.

That means anyone who works on this in the future can consider cAdvisor + monitoring endpoints largely in isolation.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

/remove-lifecycle stale
/lifecycle frozen

/remove-kind design
/kind feature

kind/design will soon be removed from k/k in favor of kind/feature. Relevant discussion can be found here: kubernetes/community#5641

/track

/track