Offering Observability (and other) Templated Links/Queries by Use Case

Question

Offering Observability (and other) Templated Links/Queries by Use Case

vlerenc opened this issue 6 months ago · comments

What would you like to be added:
@ashwani2k proposed (and showed a text-based interactive prototype) to make it simpler to collect templated links/queries and share them with operators and end users alike. This is especially handy for Prometheus/Vali(=Loki). An MCM colleague contributed his personal link/query collection (by use case; task-oriented) and Ashwani put that into a machine-readable format (see https://github.tools.sap/kubernetes/ops-guide/pull/742, not accessible to everybody, see small excerpt below), e.g.:

categories:
- title: Observability
  description: "Useful collection of links and queries (with placeholders) for shoot clusters, grouped by categories/use cases, to support Gardener operators in their tasks."
  categories:
  - title: Machines
    description: "Everything around machines, i.e. backing VMs as well as Kubernetes nodes."
    categories:
    - title: Scale Up
      description: "Identify whether scale up was triggered by CA or not."
      queries:
      - title: Check the number of nodes which were scaled up.
        query:
          type: prom
          expression: '"shoot:kube_node_info:count"'
      - title: Check if CA has triggered the scale up.
        query:
          type: vali
          expression: '{container_name="cluster-autoscaler"} |~ "Final scale-up" |~ "shoot--$projectName--$shootName-$worker-pool"'
    - title: Scale Down
      description: "Identify whether scale down was triggered by CA or not."
      queries:
        ...
    - title: Upgrade
      description: Check whether the upgrade is stuck due to any error in MCM or due to PDB violation."
      queries:
      - title: Check for errors for any machine in a worker-pool for a given provider.
        query:
          type: vali
          expression: '{container_name="machine-controller-manager-provider-$provider"} |~ "shoot--$projectName--$shootName-$worker-pool" |~ "machine codes error"'
      - title: Check if drain is stuck due to a PDB violation.
        query:
          type: vali
          expression: '{container_name="machine-controller-manager-provider-$provider"} |~ "could not be evicted from node" |~ "occur due to PDB violation"'

It would be great to make those links/queries available in the Gardener Dashboard (maybe also/even https://github.com/gardener/gardenctl-v2) for the benefit of everybody and because the Dashboard is a GUI and Plutono(=Grafana) is also one.

Why is this needed:
We do not share domain specific knowledge good enough (within a team, with adopters/our community, with end users) and even if some individuals have personal notes, most often they have them only on one specific subject matter. Newcomers start off with nothing. End users also have nothing and are even further away from our observability stack. All of them would benefit from a curated list of templated links/queries to analyse their issues/understand what puzzles them.

Vedran Lerenc · Answer 1 · Wed Feb 07 2024 04:29:01 GMT+0800 (China Standard Time)

The comment was made out-of-band by @petersutter to have the configuration "in some cluster / in some configmap". It could be maintained in GitHub, deployed automatically, without fear of breaking anything. Tools like the Dashboard (or gardenctl) could fetch it on-the-fly and show its content as needed. When used in the context of tickets, modern LLMs can help selecting the most appropriate links and filling the placeholders. That is actually also possible for the Dashboard (or gardenctl) if "there is space" to ask a question, but that's a next step (if at all). For now, it would be great to:

Lower the entry barrier (as compared to Plutono(=Grafana) dashboards in Gardener itself) and facilitate a low-risk way to collect/capture expert knowledge
...and make this information easily accessible in the clients Gardener offers, predominantly the Dashboard (but eventually also gardenctl with a focus on text-based access, e.g. log queries to be further processed with grep, sed, awk, etc.)

Vedran Lerenc · Answer 2 · Wed Feb 07 2024 04:35:48 GMT+0800 (China Standard Time)

The comment was made out-of-band by @ScheererJ that this idea can be expanded to more than links/queries, e.g. to also run pre-defined scripts (something like stored procedures for ops) in the browser or using similar technology like we use it already for web terminals.

Gardener Robot · Answer 3 · Sat Feb 17 2024 23:49:49 GMT+0800 (China Standard Time)

@vlerenc You have mentioned internal references in the public. Please check.