Swarm overlay network does not routing IP address (without userns-remap) after restart nodes
umegaya opened this issue · comments
Description
I use 3 node swarm with 3 manager on AWS, each node created by docker-machine (ami-87b917e4)
after restart nodes, some of container cannot communicate each other via IP address and service name.
Steps to produce the issue:
- create network
docker network create --driver overlay --subnet 10.0.150.0/24 prod-nw
- create 4 backend service and 1 frontend service, which is global mode. note that each container has at least 1 publish setting (I omit fluent logger setting to simplify)
docker service create --name backend-1 --replicas 1 --with-registry-auth --network prod-nw --publish 8200:8082 $(backend-1-image)
docker service create --name backend-2 --replicas 1 --with-registry-auth --network prod-nw --publish 8201:8082 $(backend-2-image)
docker service create --name backend-3 --replicas 1 --with-registry-auth --network prod-nw --publish 8100:8082 $(backend-3-image)
docker service create --name backend-4 --replicas 1 --with-registry-auth --network prod-nw --publish 8101:8082 $(backend-4-image)
docker service create --name frontend --mode global --publish mode=host,published=50051,target=50051 --with-registry-auth --network prod-nw --publish mode=host,published=8082,target=8082 $(frontend-image)
- after restart nodes, try to connect to the other service via the DNS entry/VIP
Describe the results you received:
- each container had following IPs on prod-nw:
backend-1: 10.0.150.30
backend-2: 10.0.150.12
backend-3: 10.0.150.32
backend-4: 10.0.150.17
frontend-1: 10.0.150.15
frontend-2: 10.0.150.4
frontend-3: 10.0.150.9
- most of connectivity work well except:
frontend-1 <-> backend-4
frontend-2 <-> backend-2
backend-2 -> frontend-3 (weird, because connection from frontend-3 to backend-2 seems to be established)
- and if connectivity lost, even with direct IP, got following errors:
- No route to host at 10.0.150.12 (backend-2 -> frontend-3)
$ telnet 10.0.150.9 50051 Trying 10.0.150.9... telnet: Unable to connect to remote host: No route to host $ netstat -an | grep ESTABLISHED # report connection established tcp 0 0 10.0.150.12:50051 10.0.150.9:53242 ESTABLISHED tcp 0 0 10.0.150.12:50051 10.0.150.15:55472 ESTABLISHED
- Connection timed out at 10.0.150.17 (backend-4 -> frontend-1)
telnet 10.0.150.15 50051 Trying 10.0.150.15... telnet: Unable to connect to remote host: Connection timed out
Describe the results you expected:
I expected to be able to connect to the service using the VIP created for the service and route accordingly.
Additional information you deem important (e.g. issue happens only occasionally):
its similar to #26106, but a few difference, so suggested to create as new issue:
- using docker-machine created AWS docker instance (ubuntu 16.04 LTS)
- I do not explicitly specify userns-remap setting (I'm not sure implicitly set)
- not only container name, but also specifying direct IP does not work (No route to host)
Output of docker version
:
Client:
Version: 17.03.1-ce
API version: 1.27
Go version: go1.7.5
Git commit: c6d412e
Built: Mon Mar 27 17:14:09 2017
OS/Arch: linux/amd64
Server:
Version: 17.05.0-ce
API version: 1.29 (minimum version 1.12)
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 22:10:54 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 55
Running: 7
Paused: 0
Stopped: 48
Images: 74
Server Version: 17.05.0-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 281
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: kcpanuat85bztrvktep186fg8
Is Manager: true
ClusterID: qclswzn5foalbgmlkhh2e95i6
Managers: 3
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 172.32.11.239
Manager Addresses:
172.32.11.239:2377
172.32.11.40:2377
172.32.2.28:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-79-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67GiB
Name: swarm-master
ID: YAOV:4AKS:YOJL:GKDF:HHTV:XW24:ZMOI:M7HU:7T2Q:E5PZ:5KW4:45FI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
provider=amazonec2
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.):
AWS, 3 node swarm, 3 manager, each node created by docker-machine (ami-87b917e4)
update: add traceroute output for each connection error case.
no route to host case
root@93a8e0c8f6af:/# telnet 10.0.150.32 50051
Trying 10.0.150.32...
Connected to 10.0.150.32.
Escape character is '^]'.
?^CConnection closed by foreign host.
root@93a8e0c8f6af:/# traceroute 10.0.150.32
traceroute to 10.0.150.32 (10.0.150.32), 64 hops max
1 10.0.150.32 0.002ms 0.001ms 0.001ms
root@93a8e0c8f6af:/# telnet 10.0.150.17 50051
Trying 10.0.150.17...
telnet: Unable to connect to remote host: No route to host
root@93a8e0c8f6af:/# traceroute 10.0.150.17
traceroute to 10.0.150.17 (10.0.150.17), 64 hops max
1 10.0.150.15 2998.070ms !H * 0.805ms !H
root@93a8e0c8f6af:/# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default ip-172-18-0-1.a 0.0.0.0 UG 0 0 0 eth1
10.0.150.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth1
root@93a8e0c8f6af:/#
connection timeout case
root@b2c68c626491:/# telnet 10.0.150.32 50051
Trying 10.0.150.32...
telnet: Unable to connect to remote host: Connection timed out
root@b2c68c626491:/# traceroute 10.0.150.32
traceroute to 10.0.150.32 (10.0.150.32), 64 hops max
1 * * *
2 * * *
3 * * *
4 * * *
5 * ^C
root@b2c68c626491:/# telnet 10.0.150.30 50051
Trying 10.0.150.30...
Connected to 10.0.150.30.
Escape character is '^]'.
?^CConnection closed by foreign host.
root@b2c68c626491:/# traceroute 10.0.150.30
traceroute to 10.0.150.30 (10.0.150.30), 64 hops max
1 10.0.150.30 0.002ms 0.001ms 0.002ms
root@b2c68c626491:/# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default ip-172-18-0-1.a 0.0.0.0 UG 0 0 0 eth1
10.0.150.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth1
more update:
one by one restart nodes seems to solve this problem (re-create service does not solve). first time I restart nodes concurrently like docker-machine start/regenerate-certs node1 node2 node3... so this issue may related with initializing swarm cluster.
more update2:
sorry but after update service several times, it seems to happen again
ping @sanimej
@umegaya Can you try the 17.06 CE version ? Before looking into the details of the issue lets confirm if its still seen with the latest.
EDIT: ok I understand 17.05 has postponement for docker daemon command. I need to edit start script, but it works. now issue not happen. but nature of this problem, it may happen after service updated, I report here when it happens.
@sanimej I try upgrade docker machine with docker-machine upgrade node1 node2 node3
(because brand new machine I didn't have problem)
then got error:
Waiting for SSH to be available...
Waiting for SSH to be available...
Waiting for SSH to be available...
Detecting the provisioner...
Detecting the provisioner...
Detecting the provisioner...
Waiting for SSH to be available...
Detecting the provisioner...
Waiting for SSH to be available...
Waiting for SSH to be available...
Detecting the provisioner...
Detecting the provisioner...
Installing Docker...
Installing Docker...
Installing Docker...
error installing docker:
error installing docker:
error installing docker:
syslog of one of node says:
Jul 20 23:51:42 mgo-prod-m systemd[1]: Starting Docker Application Container Engine...
Jul 20 23:51:42 mgo-prod-m docker[17401]: `docker daemon` is not supported on Linux. Please run `dockerd` directly
Jul 20 23:51:42 mgo-prod-m systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
second line already shows before upgrading docker-machine, and dockerd seems to run that case.
this shows already some of docker persistent status broken? or another problem?
docker daemon
is not supported on Linux. Please rundockerd
directly
That's caused by a combination of issues; the docker
binary in 17.06 does not have the daemon
subcommand (its deprecated, but should still work in 17.06; this will be fixed in 17.06.1). The second issue is that docker-machine created a systemd override-file with the wrong command for the version of docker that's used; check this file on those machines; /etc/systemd/system/docker.service.d/10-machine.conf
, change docker daemon
to dockerd
, then systemctl daemon-reload
, and systemctl restart docker.service