stefanprodan / swarmprom

Docker Swarm instrumentation with Prometheus, Grafana, cAdvisor, Node Exporter and Alert Manager

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Instance down

Dean-Christian-Armada opened this issue · comments

Have you ever tried creating a rule like if the node went down then it will throw an alert?

Node exporter and cadvisor are running on each Swarm node, so you can configure an alert for up{job="node-exporter"}

I don't think it is effective enough. As the value 0 of that certain node-exporter will not be present for long. Also, it shows only the instance IP and not the node_name.. I tried grouping it with node_name but it will not show up at all please see photos below

Screenshot of up with a down node-exporter
screen shot 2018-02-23 at 10 12 15

Screenshot of up grouping it with node_meta
screen shot 2018-02-23 at 10 13 12

You can use IF absent(node_meta) FOR 5m

Hi @stefanprodan , what should be the expected value on the absent(node_meta) query? The case is if there is even just a single node that went down. Specifically for my case, my "swarm-node-2" went down.

The photo below is what returned when I intentionally downed my swarm-node-2

screen shot 2018-02-26 at 10 11 52

@Dean-Christian-Armada , I am also facing the same problem. I want to create a rule whenever a node is down.
Also if a container is down I should get alert for the same.

@abhisheks-cuelogic , "Container down", you mean if you have a python container that went down then it will alert? I don't think it's possible with the container part. Prometheus needs node-exporter or other scraping like tool to determine metrics. Unless, there is an agent that can be installed inside the container to determine if it went down.

Not the container itself should alert. Can we use something like :

ALERT piwik_nginx
IF count(time() - container_last_seen{name=~"^piwik_nginx.*"} < 60)
ANNOTATIONS {
summary = "piwik_nginx container is down",
description = "piwik_nginx is down for more tha 1 minute",
}

I tried this rule, but somehow alert is always active even container is up.
prometheus-alert

@stefanprodan , we need your advise.