Improve metrics library to enforce best practices

Question

Improve metrics library to enforce best practices

lambdanis opened this issue 5 months ago · comments

There were many changes made recently to improve how Prometheus metrics are defined and managed in Tetragon:

cleanup metrics for deleted pods to prevent growing cardinality: #1279
labels configuration for high-cardinality metrics: #1444 and follow-up refactorings #1548, #2321 and #2373
expose metrics directly from BPF maps: #1510 (and a few PRs using helpers introduced there)
initialize metrics with labels for predictable resources usage and easier queries: #2162
autogenerated metrics docs and grouping metrics by function: #2164
multiple fixes to individual metrics

Metrics now seem to be in a decent place. However, it's not intuitive for developers how to define them. Things like labels configuration, initialization and separate helpers for docs can be confusing.

The goal of this issue is to extend pkg/metrics library to provide an intuitive interface for defining metrics following best practices. Ideally we should also write dev docs and add metrics linting to CI.

Anna Kapuścińska · Answer 1 · Thu Aug 08 2024 07:26:58 GMT+0800 (China Standard Time)

Here's the proposal for Tetragon metrics framework: https://docs.google.com/document/d/1oP0hZ_yKHqflhRJdpIzaoIUzniCrKQz6_L1sxw1JVYA/edit?usp=sharing

The library part is merged: #2606

The next steps are refactoring existing metrics to fully use it. This is tracked in a dedicated project: https://github.com/orgs/cilium/projects/57