severalnines / docker

ClusterControl docker image

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

k8s: Prometheus missing after restart of pod

0xErnie opened this issue · comments

This is a followup to #32

We are running clustercontrol on kubernetes.
Our manifests can be found in this gist.
After a deletion of the pod prometheus fails to start.

Steps to reproduce:

  1. Run ClusterControl on kubernetes
  2. Enable agent based monitoring on one cluster
  3. Wait until monitoring is ready
  4. Delete the pod
  5. Wait until a new pod is running
  6. Check the monitoring for one cluster

What I expect to happen:
Monitoring should stay up and running.

What happens:
Monitoring is not working anymore.
Bildschirmfoto 2021-10-13 um 15 30 56

Jobs named "recovering monitoring system" are flooding the job log.
cc-failed-recovery.txt

How one can resolve the situation, temporarily:

  1. Open the monitoring view of one Cluster
  2. Trigger "Re-Enable Agent Based Monitoring"
  3. Wait until the job is finished
    cc-redeployment.txt

This request requires Prometheus to be preloaded into the image (otherwise, you would have to re-enable the monitoring every time a pod is rescheduled). I will need to spend some time on testing. Working on it now.

Hi @0xErnie ,

I have pushed this commit: c2e0e13 so the image is pre-built with Prometheus. Available in the following tags on Docker Hub: latest, 1.9.0-3 or 1.9.0.

I am closing this issue now. Thank you.