k8s: Redeploying Prometheus failed after pod deletion

Question

k8s: Redeploying Prometheus failed after pod deletion

0xErnie opened this issue 3 years ago · comments

We are running Clustercontrol on Kubernetes.

Monitoring seemed to work for some time, now we get the following errors on a regular basis.

[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>:9090: Recovery failure: <strong style='color: red;'> -------</strong>
[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>:9090: Couldn't start Prometheus: Command exited with return code 127 on host clustercontrol.
Command: cd /var/lib/prometheus || true
export PATH=/usr/local/bin:/usr/bin:/bin:$PATH
export HOME=/var/lib/prometheus
nohup daemon --name=prometheus --inherit --output=/var/log/prometheus/prometheus.log --env="PATH=/usr/local/bin:/usr/bin:$PATH"  --chdir=/var/lib/prometheus  --pidfile=/var/run/prometheus/prometheus.pid --user=prometheus -- prometheus  --config.file=/etc/prometheus/prometheus.yml  --storage.tsdb.retention.time=15d --storage.tsdb.retention.size=0
stdErr: nohup: failed to run command 'daemon': No such file or directory
.
[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>:9090: starting script:
----8<-----8<-----8<-----
 cd /var/lib/prometheus || true
export PATH=/usr/local/bin:/usr/bin:/bin:$PATH
export HOME=/var/lib/prometheus
nohup daemon --name=prometheus --inherit --output=/var/log/prometheus/prometheus.log --env="PATH=/usr/local/bin:/usr/bin:$PATH"  --chdir=/var/lib/prometheus  --pidfile=/var/run/prometheus/prometheus.pid --user=prometheus -- prometheus  --config.file=/etc/prometheus/prometheus.yml  --storage.tsdb.retention.time=15d --storage.tsdb.retention.size=0
----8<-----8<-----8<-----

[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>:       Error nohup: failed to run command 'daemon': No such file or directory
.
[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>: Command was <strong style='color: orange;'>cd /var/lib/prometheus || true
export PATH=/usr/local/bin:/usr/bin:/bin:$PATH
export HOME=/var/lib/prometheus
nohup daemon --name=prometheus --inherit --output=/var/log/prometheus/prometheus.log --env="PATH=/usr/local/bin:/usr/bin:$PATH"  --chdir=/var/lib/prometheus  --pidfile=/var/run/prometheus/prometheus.pid --user=prometheus -- prometheus  --config.file=/etc/prometheus/prometheus.yml  --storage.tsdb.retention.time=15d --storage.tsdb.retention.size=0</strong>.
[19:57:12]:<em style='color: #f3990b;'>clustercontrol</em>: Execution failed with return code 127.
[19:57:11]:<em style='color: #f3990b;'>clustercontrol</em>:9090: Recovering '<strong style='color: #59a449;'>prometheus</strong>'.
Job spec:
Recovering monitoring system.

Alexander Kauerz · Answer 1 · Tue Oct 05 2021 18:39:50 GMT+0800 (China Standard Time)

It seems like also something with the local CAs is wrong:

# wget http://libslack.org/daemon/download/daemon-0.6.4-1.x86_64.rpm
--2021-10-05 10:36:32--  http://libslack.org/daemon/download/daemon-0.6.4-1.x86_64.rpm
Resolving libslack.org (libslack.org)... 139.99.156.21, 2402:1f00:8100:400::31
Connecting to libslack.org (libslack.org)|139.99.156.21|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://libslack.org/daemon/download/daemon-0.6.4-1.x86_64.rpm [following]
--2021-10-05 10:36:33--  https://libslack.org/daemon/download/daemon-0.6.4-1.x86_64.rpm
Connecting to libslack.org (libslack.org)|139.99.156.21|:443... connected.
ERROR: cannot verify libslack.org's certificate, issued by '/C=US/O=Let\'s Encrypt/CN=R3':
  Issued certificate has expired.
To connect to libslack.org insecurely, use `--no-check-certificate'.

Ashraf Sharif · Answer 2 · Wed Oct 13 2021 10:34:59 GMT+0800 (China Standard Time)

Hi @0xErnie

While I am trying to reproduce this problem, would it be possible for you to share the Kubernetes YAML file here? You may redact the sensitive info beforehand.

Alexander Kauerz · Answer 3 · Wed Oct 13 2021 19:14:24 GMT+0800 (China Standard Time)

Shure, you can find in this gist.

Ashraf Sharif · Answer 4 · Wed Oct 13 2021 20:48:11 GMT+0800 (China Standard Time)

I managed to reproduce this issue and a fix has been pushed. Please update to the latest 1.9.0 image from the Docker hub (tag: 1.9.0, 1.9.0-2, or latest).

Alexander Kauerz · Answer 5 · Wed Oct 13 2021 21:59:06 GMT+0800 (China Standard Time)

@ashraf-s9s Thank you, this resolves the redeployment issue and prometheus is still missing.
I created a followup here: #33