severalnines / docker

ClusterControl docker image

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

k8s: Redeploying Prometheus failed after pod deletion

0xErnie opened this issue · comments

We are running Clustercontrol on Kubernetes.

Monitoring seemed to work for some time, now we get the following errors on a regular basis.

[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>:9090: Recovery failure: <strong style='color: red;'> -------</strong>
[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>:9090: Couldn't start Prometheus: Command exited with return code 127 on host clustercontrol.
Command: cd /var/lib/prometheus || true
export PATH=/usr/local/bin:/usr/bin:/bin:$PATH
export HOME=/var/lib/prometheus
nohup daemon --name=prometheus --inherit --output=/var/log/prometheus/prometheus.log --env="PATH=/usr/local/bin:/usr/bin:$PATH"  --chdir=/var/lib/prometheus  --pidfile=/var/run/prometheus/prometheus.pid --user=prometheus -- prometheus  --config.file=/etc/prometheus/prometheus.yml  --storage.tsdb.retention.time=15d --storage.tsdb.retention.size=0
stdErr: nohup: failed to run command 'daemon': No such file or directory
.
[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>:9090: starting script:
----8<-----8<-----8<-----
 cd /var/lib/prometheus || true
export PATH=/usr/local/bin:/usr/bin:/bin:$PATH
export HOME=/var/lib/prometheus
nohup daemon --name=prometheus --inherit --output=/var/log/prometheus/prometheus.log --env="PATH=/usr/local/bin:/usr/bin:$PATH"  --chdir=/var/lib/prometheus  --pidfile=/var/run/prometheus/prometheus.pid --user=prometheus -- prometheus  --config.file=/etc/prometheus/prometheus.yml  --storage.tsdb.retention.time=15d --storage.tsdb.retention.size=0
----8<-----8<-----8<-----

[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>:       Error nohup: failed to run command 'daemon': No such file or directory
.
[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>: Command was <strong style='color: orange;'>cd /var/lib/prometheus || true
export PATH=/usr/local/bin:/usr/bin:/bin:$PATH
export HOME=/var/lib/prometheus
nohup daemon --name=prometheus --inherit --output=/var/log/prometheus/prometheus.log --env="PATH=/usr/local/bin:/usr/bin:$PATH"  --chdir=/var/lib/prometheus  --pidfile=/var/run/prometheus/prometheus.pid --user=prometheus -- prometheus  --config.file=/etc/prometheus/prometheus.yml  --storage.tsdb.retention.time=15d --storage.tsdb.retention.size=0</strong>.
[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>: Execution failed with return code 127.
[19:57:11]:<em style='color: #f3990b;'>clustercontrol</em>:9090: Recovering '<strong style='color: #59a449;'>prometheus</strong>'.
Job spec:
Recovering monitoring system.

It seems like also something with the local CAs is wrong:

# wget http://libslack.org/daemon/download/daemon-0.6.4-1.x86_64.rpm
--2021-10-05 10:36:32--  http://libslack.org/daemon/download/daemon-0.6.4-1.x86_64.rpm
Resolving libslack.org (libslack.org)... 139.99.156.21, 2402:1f00:8100:400::31
Connecting to libslack.org (libslack.org)|139.99.156.21|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://libslack.org/daemon/download/daemon-0.6.4-1.x86_64.rpm [following]
--2021-10-05 10:36:33--  https://libslack.org/daemon/download/daemon-0.6.4-1.x86_64.rpm
Connecting to libslack.org (libslack.org)|139.99.156.21|:443... connected.
ERROR: cannot verify libslack.org's certificate, issued by '/C=US/O=Let\'s Encrypt/CN=R3':
  Issued certificate has expired.
To connect to libslack.org insecurely, use `--no-check-certificate'.

Hi @0xErnie

While I am trying to reproduce this problem, would it be possible for you to share the Kubernetes YAML file here? You may redact the sensitive info beforehand.

Shure, you can find in this gist.

I managed to reproduce this issue and a fix has been pushed. Please update to the latest 1.9.0 image from the Docker hub (tag: 1.9.0, 1.9.0-2, or latest).

@ashraf-s9s Thank you, this resolves the redeployment issue and prometheus is still missing.
I created a followup here: #33