Only see 2 nodes out of the 3 masters
lucj opened this issue · comments
I have deployed swarmprom on a 3 nodes cluster on Docker for AWS. All nodes are masters and are running fine but and only 2 nodes are listed in Grafana, a couple of my app stacks are also missing.
All the swarmprom services seem to run fine though.
Any hints ?
btw, thanks a lot, really great project ! 👍
Hi @lucj can you please check Prometheus UI and see if that node is reachable?
In my current config, I have a nginx TLS termination in front of Caddy where only port 3000 is proxied. If I proxy the other ports (9090, 9093, 9094), I do not manage to have the whole stack working correctly.
I don't know if this is related but I see several errors like
"[ERROR 502 /] dial tcp: lookup grafana on 127.0.0.11:53: no such host" in Caddy.
ok, I activated ssl in Caddy instead of another nginx in front.
The prometheus console only shows 2 nodes
But Swarm has 3 nodes, each one is working fine.
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
ugk...909luj * xxx.compute.internal Ready Active Reachable 17.12.1-ce
ms7...yobt1n yyy.compute.internal Ready Active Leader 17.12.1-ce
mcd...lb6zpq zzz.compute.internal Ready Active Reachable 17.12.1-ce
Definitely not linked to swarmprom.
The problem seems to be linked to a connectivity issue in the mon_net network.
I guess it got lost after too many start/stop.
@stefanprodan just for the record: the reason why the 3rd node was not seen is because the mon_net network remained on this particular node after the stack was removed (this is a known bug moby/moby#35204).
When the stack was started once again, the mon_net network was recreated with the same subnet thus preventing the node-exporter to attach to the network as the subnet is already used.
c61dob322u72hskjyk5nn0dzg mon_node-exporter.ms71f27l8jwtqexhr53yobt1n stefanprodan/swarmprom-node-exporter:v0.15.2@sha256:0575845ee924fa91138804663a12207ed53a56542d257273ffb9b30e22b78cd1 ip-172-31-27-219.eu-west-1.compute.internal Ready Rejected 3 seconds ago "failed to allocate gateway (10.0.0.1): Address already in use"
zxv4pi94uii1gmfok9c9b28mp \_ mon_node-exporter.ms71f27l8jwtqexhr53yobt1n
...
The thing is I cannot remove the network from this particular node as a container is still connected to it and for an unknown reason refused to be removed as well (!).
~ $ docker network rm mon_net
Error response from daemon: network mon_net id 1d34qanxyqqxaytteudtd1pfj has active endpoints
~ $ docker network inspect mon_net
...
"Containers": {
"80c0fbb70da2342b82367897ddd1028593f70a87b2b5929804b4ee527c10b12f": {
"Name": "mon_caddy.1.vrm2namdg5vc2mk9fqcdbk1xs",
"EndpointID": "c3d1aeb8b3089a51e0ec3c05703b7c6d81e2a9f95238a6aa1291f38e1c84bdb3",
"MacAddress": "02:42:0a:00:00:16",
"IPv4Address": "10.0.0.22/24",
"IPv6Address": ""
}
},
...
~ $ docker container rm -f 80c0fbb70da2342b
=> hanging....
I then specified another network with it's own subnet and it's working fine.
networks:
netx:
driver: overlay
attachable: true
ipam:
driver: default
config:
- subnet: 10.0.20.0/24
I'm re-opning the issue as I see the same behavior on another environment.
This test env is running Docker For AWS (17.12.1-ce).
After several starts / stops of the swarmprom stack, the Caddy container is hanging and cannot be removed thus preventing the network from being removed as well.
Already noticed the same behavior ?
I am already facing the same issue