Only see 2 nodes out of the 3 masters

Question

Only see 2 nodes out of the 3 masters

lucj opened this issue 6 years ago · comments

I have deployed swarmprom on a 3 nodes cluster on Docker for AWS. All nodes are masters and are running fine but and only 2 nodes are listed in Grafana, a couple of my app stacks are also missing.
All the swarmprom services seem to run fine though.
Any hints ?

btw, thanks a lot, really great project ! 👍

Stefan Prodan · Answer 1 · Tue Apr 17 2018 14:49:51 GMT+0800 (China Standard Time)

Hi @lucj can you please check Prometheus UI and see if that node is reachable?

Luc · Answer 2 · Tue Apr 17 2018 16:55:48 GMT+0800 (China Standard Time)

In my current config, I have a nginx TLS termination in front of Caddy where only port 3000 is proxied. If I proxy the other ports (9090, 9093, 9094), I do not manage to have the whole stack working correctly.

I don't know if this is related but I see several errors like
"[ERROR 502 /] dial tcp: lookup grafana on 127.0.0.11:53: no such host" in Caddy.

Luc · Answer 3 · Tue Apr 17 2018 17:09:10 GMT+0800 (China Standard Time)

ok, I activated ssl in Caddy instead of another nginx in front.
The prometheus console only shows 2 nodes

But Swarm has 3 nodes, each one is working fine.

ID                            HOSTNAME                                      STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
ugk...909luj *   xxx.compute.internal    Ready               Active              Reachable           17.12.1-ce
ms7...yobt1n     yyy.compute.internal   Ready               Active              Leader              17.12.1-ce
mcd...lb6zpq    zzz.compute.internal    Ready               Active              Reachable           17.12.1-ce

Luc · Answer 4 · Tue Apr 17 2018 19:33:55 GMT+0800 (China Standard Time)

Definitely not linked to swarmprom.
The problem seems to be linked to a connectivity issue in the mon_net network.
I guess it got lost after too many start/stop.

Luc · Answer 5 · Wed Apr 18 2018 17:53:42 GMT+0800 (China Standard Time)

@stefanprodan just for the record: the reason why the 3rd node was not seen is because the mon_net network remained on this particular node after the stack was removed (this is a known bug moby/moby#35204).

When the stack was started once again, the mon_net network was recreated with the same subnet thus preventing the node-exporter to attach to the network as the subnet is already used.

c61dob322u72hskjyk5nn0dzg   mon_node-exporter.ms71f27l8jwtqexhr53yobt1n       stefanprodan/swarmprom-node-exporter:v0.15.2@sha256:0575845ee924fa91138804663a12207ed53a56542d257273ffb9b30e22b78cd1   ip-172-31-27-219.eu-west-1.compute.internal   Ready               Rejected 3 seconds ago       "failed to allocate gateway (10.0.0.1): Address already in use"
zxv4pi94uii1gmfok9c9b28mp    \_ mon_node-exporter.ms71f27l8jwtqexhr53yobt1n
...

The thing is I cannot remove the network from this particular node as a container is still connected to it and for an unknown reason refused to be removed as well (!).

~ $ docker network rm mon_net
Error response from daemon: network mon_net id 1d34qanxyqqxaytteudtd1pfj has active endpoints

~ $ docker network inspect mon_net
...
        "Containers": {
            "80c0fbb70da2342b82367897ddd1028593f70a87b2b5929804b4ee527c10b12f": {
                "Name": "mon_caddy.1.vrm2namdg5vc2mk9fqcdbk1xs",
                "EndpointID": "c3d1aeb8b3089a51e0ec3c05703b7c6d81e2a9f95238a6aa1291f38e1c84bdb3",
                "MacAddress": "02:42:0a:00:00:16",
                "IPv4Address": "10.0.0.22/24",
                "IPv6Address": ""
            }
        },
...

~ $ docker container rm -f 80c0fbb70da2342b
=> hanging....

I then specified another network with it's own subnet and it's working fine.

networks:
  netx:
    driver: overlay
    attachable: true
    ipam:
      driver: default
      config:
      - subnet: 10.0.20.0/24

Luc · Answer 6 · Thu Sep 06 2018 17:21:18 GMT+0800 (China Standard Time)

I'm re-opning the issue as I see the same behavior on another environment.
This test env is running Docker For AWS (17.12.1-ce).
After several starts / stops of the swarmprom stack, the Caddy container is hanging and cannot be removed thus preventing the network from being removed as well.
Already noticed the same behavior ?

Muhammad Hassan Nasr · Answer 7 · Mon Feb 25 2019 21:50:59 GMT+0800 (China Standard Time)

I am already facing the same issue