coder / observability

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Epic: Bundled Observability

dannykopping opened this issue · comments

This is the holding issue for the set of improvements we want to make based on RFC: Bundled Observability.

Goal: produce a separate Helm chart which can be installed with one script to observe a Coder deployment. Each sub-chart should contain at least one dashboard, alert, and runbook to cover its own functional requirements (i.e. the Grafana installation should have a dashboard, alert, runbook to enable operators to observe it). The Coder deployment should be covered by several dashboards, alerts & runbooks, and we should collect all telemetry signals as applicable (metrics, logs, traces, profiles).

Initial Requirements:

Eventual Requirements:

  • Add & configure profiling chart, add Grafana datasource
  • Add & configure tracing chart, add Grafana datasource

Adhoc task list:

General

  • handle unmanaged resources (i.e. resources created but subsequently not managed by helm)
    • make monitoring.prod deletes the existing manifests before generating new ones, meaning that if a helm chart value is changed to exclude a resource which was previously present, its manifest will now be missing and therefore Flux will remove it
  • find a way to include common patches & other configs in local/prod kustomizations
  • prevent secrets from charts being detected by {en,de}crypt_secrets.sh https://github.com/coder/dogfood/pull/56
  • add githook to detect manifest differences locally https://github.com/coder/dogfood/pull/56
  • add linter to detect manifest differences before PR merge is allowed https://github.com/coder/dogfood/pull/57
  • add differ to show the impact of the new manifests on the existing cluster
  • https://github.com/coder/dogfood/issues/53

Prometheus / Alertmanager

  • ignore container ports discovered with kubernetes_sd_config which do not expose metrics (i.e. Loki's gRPC ports) (use extraScrapeConfigs)
  • alert source addresses should use FQDN (prometheus-server.monitoring.svc.cluster.local)
  • include runbook link via label
  • test out slack receiver
  • optimise labels and unify with loki labels

Grafana

Loki

  • optimise labels
  • scrape logs from all pods
  • add cluster label to loki metrics

Can we add this as a step in our Kubernetes install docs as Recommended, but optional? https://coder.com/docs/v2/latest/install/kubernetes

Can we add this as a step in our Kubernetes install docs as Recommended, but optional? https://coder.com/docs/v2/latest/install/kubernetes

Absolutely, will add an item