cilium / tetragon

eBPF-based Security Observability and Runtime Enforcement

Home Page:https://tetragon.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve metrics library to enforce best practices

lambdanis opened this issue · comments

There were many changes made recently to improve how Prometheus metrics are defined and managed in Tetragon:

  • cleanup metrics for deleted pods to prevent growing cardinality: #1279
  • labels configuration for high-cardinality metrics: #1444 and follow-up refactorings #1548, #2321 and #2373
  • expose metrics directly from BPF maps: #1510 (and a few PRs using helpers introduced there)
  • initialize metrics with labels for predictable resources usage and easier queries: #2162
  • autogenerated metrics docs and grouping metrics by function: #2164
  • multiple fixes to individual metrics

Metrics now seem to be in a decent place. However, it's not intuitive for developers how to define them. Things like labels configuration, initialization and separate helpers for docs can be confusing.

The goal of this issue is to extend pkg/metrics library to provide an intuitive interface for defining metrics following best practices. Ideally we should also write dev docs and add metrics linting to CI.

Here's the proposal for Tetragon metrics framework: https://docs.google.com/document/d/1oP0hZ_yKHqflhRJdpIzaoIUzniCrKQz6_L1sxw1JVYA/edit?usp=sharing

The library part is merged: #2606

The next steps are refactoring existing metrics to fully use it. This is tracked in a dedicated project: https://github.com/orgs/cilium/projects/57