RHEL7/CentOS7 cannot reach another container published network service

Question

RHEL7/CentOS7 cannot reach another container published network service

jcberthon opened this issue 7 years ago · comments

Description

On CentOS 7 and RHEL 7 (and possibly any Linux OS using firewalld) we have the following problem that when at least 2 containers are running on the same host, one container cannot access the network services offered by the other by using the "external" IP address or hostname. The error message returned is host unreachable or no route to host.

Note that some times even when setting a "hostname" to a docker container (via the --hostname option), and being able to ping that hostname from another container (the hostname is then resolved to the Docker internal IP), it might still not work because some applications (e.g. gitlab-runner) are still resolving the given hostname using the external DNS resolver and not the one of the Docker network. Weird but true.

Someone reported already the problem (#24370) but did not provide enough information, and thus the issue was closed. I have all necessary information, and I can provide more on demand.

Steps to reproduce the issue:

I have found a series of steps that are easy to do by anyone and can reproduce the problem. It is assuming that in your home directory you have a html-pub folder in which there is a static index.html file (mkdir ~/html-pub then download and simple HTML static file from the internet and put it in that folder). All commands are run on the host where Docker 17.03 is running.

It is also assumed that the IP address of the host is 192.168.1.2.

docker run --name nginx --detach -p 192.168.1.2:80:80 -v ~/html-pub:/usr/share/nginx/html:ro nginx:stable-alpine
docker run --rm -it alpine:3.5 wget http://192.168.1.2/

Describe the results you received:

On CentOS 7 with firewalld installed I receive this:

Connecting to 192.168.1.2 (192.168.1.2:80)
wget: can't connect to remote host (192.168.1.2): Host is unreachable

Describe the results you expected:

On Ubuntu without firewalld (but still with a firewall), I get this:

Connecting to 192.168.1.2 (192.168.1.2:80)
index.html           100% |*******************************|  3700   0:00:00 ETA

Additional information you deem important (e.g. issue happens only occasionally):

On CentOS 7, doing the following solved the problem. But I would expect the docker run command to do those extra steps as I used the -p flag.

sudo firewall-cmd --zone=trusted --add-interface=docker0
sudo firewall-cmd --zone=public --add-port=80/tcp

Note: The above command are for testing. If one wants them permanent, one needs to add the --permanent flag to both commands and then execute sudo firewall-cmd --reload.

Update 20170330: actually only the second command is enough, adding docker0 to the trusted zone has no effect.

The above is a dummy example. A real life test case where this is failing us is when running on the same host the GitLab and GitLab Runner containers. We had to use a different hostname for the docker run command than the real hostname users use to access our own internal instance of GitLab in order for the gitlab-runner toregister successfully. But then when trying to use that runner, it cannot clone the repository, GitLab provides the "external" FQDN for the repository the runner should clone, and the runner fails before even starting the job because the host is unreachable. The nginx example is therefore relevant and a much easier way of demonstrating the issue.

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 08:10:07 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   3a232c8
 Built:        Tue Feb 28 08:10:07 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 28
 Running: 10
 Paused: 0
 Stopped: 18
Images: 300
Server Version: 17.03.0-ce
Storage Driver: devicemapper
 Pool Name: vg_spc-thpl_docker
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 23.87 GB
 Data Space Total: 1.44 TB
 Data Space Available: 1.416 TB
 Metadata Space Used: 12.45 MB
 Metadata Space Total: 16.98 GB
 Metadata Space Available: 16.97 GB
 Thin Pool Minimum Free Space: 144 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.10.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 32 GiB
Name: *******************
ID: ****************************
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

The host runs CentOS 7.3 on bare-metal (so physical). But I have also reproduced it inside a VM (KVM) during the investigation.

I also tried the above (and got the expected result) on a Ubuntu 16.04 LTS running the 4.8 HWE kernel, this is also a bare-metal machine, x86_64 too but with only 2 CPUs and 8GiB RAM, the storage driver is btrfs.

Jean-Christophe commented 7 years ago

Ping!

Sebastiaan van Stijn · Answer 1 · Mon Mar 27 2017 21:46:21 GMT+0800 (China Standard Time)

Was firewalld restarted after the docker daemon / docker service was started? IIRC, there's issues with firewalld wiping docker IPTables rules if it's (re)started after the docker service was started.

Jean-Christophe · Answer 2 · Mon Mar 27 2017 21:46:35 GMT+0800 (China Standard Time)

I forgot to give credits to the person who solved the problem: Nena on StackOverflow.

Jean-Christophe · Answer 3 · Mon Mar 27 2017 21:48:52 GMT+0800 (China Standard Time)

Hi @thaJeztah

Nope, firewalld was not restarted. Between step 1 and step 2, nothing is done on the machine. It is possible to have step 0 systemctl restart docker

Sebastiaan van Stijn · Answer 4 · Mon Mar 27 2017 21:53:12 GMT+0800 (China Standard Time)

Oh, sorry for the confusion; I meant the docker daemon; i.e. if you systemctl restart docker to restart the docker service (note; this will stop all containers without a restart policy set), does it work?

The docker daemon creates certain IPTables rules when the service is started; if firewalld is (re)started, it does not take those rules into account, and wipes them

Jean-Christophe · Answer 5 · Mon Mar 27 2017 22:21:07 GMT+0800 (China Standard Time)

No problem, thanks for helping :-)

I know I cannot do on the main server the restart of Docker because it is configured to restart all container (We are running CentOS 7.3 so we have this bug where we need to set mounts as private in the systemd unit in order to avoid leaking LVM mounts, but this implies that we cannot use live restore).

However, I can do that on my KVM instance (virtual machine), on which nothing important is running. So on this VM, I'm also running CentOS 7.3 with all updates applied (as of this weekend), the setup is similar than on the other host where we run Docker 17.03.0-ce. The only difference is that the storage driver is overlay instead of LVM.

So when I do this after a clean reboot:

systemctl restart docker
docker run --name nginx --detach -p 192.168.1.2:80:80 -v ~/html-pub:/usr/share/nginx/html:ro nginx:stable-alpine
docker run --rm -it alpine:3.5 wget http://192.168.1.2/

This fails with host unreachable as reported. And this is the same error I see in production on the other host.

Sebastiaan van Stijn · Answer 6 · Mon Mar 27 2017 22:48:35 GMT+0800 (China Standard Time)

Thanks for taking the time to try that. I just tried to reproduce on a fresh CentOS 7.3 droplet on DigitalOcean, but was not able to reproduce 😢.

To exclude possibilities;

Is the daemon started with custom options (particularly, e.g. --iptables=false, or a custom -b / --bridge)?
Do you see the same if the nginx container is started without a custom webroot (-v), and its ports published without specifying an IP-address (simply -p 80:80)?
Is the NGINX container reachable externally (so not from inside a container, but an external browser, contacting the IP-address)?

Jean-Christophe · Answer 7 · Tue Mar 28 2017 04:39:18 GMT+0800 (China Standard Time)

So I did the same as you. I created a CentOS 7 droplet, then I did a distro-sync and made sure firewalld is activated before rebooting:

yum distro-sync
systemctl enable firewalld
reboot

Then I installed Docker using these instructions: https://store.docker.com/editions/community/docker-ce-server-centos?tab=description

Then I created the following container:

docker run --name nginx --detach -p 80:80 nginx:stable-alpine

Using the public IP of my droplet, I was able to verify that I can access the HTTP page using my web browser (it displays the welcome to nginx static page, so Docker configured correctly the firewall, cool!).

Now I create another container and I pass my droplet public IP (named <droplet-public-IP> it is the ipv4 which DigitalOcean displays in their Droplets WebUI):

docker run --rm -it alpine:3.5 wget http://<droplet-public-IP>/

And I get the host is unreachable message.

My guess why you could not reproduce it is: CentOS 7 droplets have firewalld disabled by default. You need to activate it.

Jean-Christophe · Answer 8 · Wed Mar 29 2017 04:29:49 GMT+0800 (China Standard Time)

Hi @thaJeztah

I realised that my previous post was not clear enough, so regarding your 3 points:

No, no custom options are used. On prod the daemon.json is empty, and on my VM I had overlay define for the graph driver, but that's it.
Yes I see the same. In my previous post, where I reproduce the problem on a droplet, I use your suggestion.
Yes the nginx container (and on our prod the GitLab container) is accessible externally via a web browser. In my reproduce test on a droplet, I'm also able to display the static HTML file from my local laptop.

I hope that given the instructions in my previous post, you can also reproduce it on your end.

Thanks for your support btw!

Jean-Christophe · Answer 9 · Thu Mar 30 2017 20:19:22 GMT+0800 (China Standard Time)

Hi @thaJeztah

Did you manage to reproduce it using my updated instructions?

Jean-Christophe · Answer 10 · Fri Mar 31 2017 23:10:50 GMT+0800 (China Standard Time)

Hello @thaJeztah

I did some further investigation.

I did a new test using a Ubuntu droplet. I set it up like I did for the CentOS one, but of course using apt instead of yum, etc. To make the test more relevant, I activated ufw before installing docker:

# ufw allow 22
# ufw enable

Then I installed Docker and run the same docker container (as described for CentOS). But instead of getting host unreachable right away. After a long period (ca. a minute) I get a timeout. So here again the firewall is blocking inter-container communication when using a public IP.

To solve that, on CentOS one can use the command in my original issue post or refer to here https://serverfault.com/questions/684602/how-to-open-port-for-a-specific-ip-address-with-firewall-cmd-on-centos if one wants to restrict which source IP can connect to the opened port (e.g. giving the docker network IP range as source). e.g.

# firewall-cmd --zone=public --add-rich-rule='
  rule family="ipv4"
  source address="172.18.0.1/24"
  port protocol="tcp" port="80" accept'

Or you can do (it is similar to the above)

# firewall-cmd --permanent --new-zone=special
# firewall-cmd --reload
# firewall-cmd --zone=special --add-source=172.18.0.1/24
# firewall-cmd --zone=special --add-port=80/tcp

On Ubuntu do:

# ufw allow proto tcp from 172.18.0.1/24 to any port 80

So it really depends on how the firewall is configured in the first place. Perhaps Docker could make sure that Docker containers can connect to each others when using the external IP address by creating these rules automatically.

Jean-Christophe · Answer 11 · Wed Apr 05 2017 21:37:56 GMT+0800 (China Standard Time)

Hello @thaJeztah

Any news on this front? How can I help further on this topic?

whgibbo · Answer 12 · Thu Jul 13 2017 00:22:13 GMT+0800 (China Standard Time)

Any news on this ?

Jean-Christophe · Answer 13 · Thu Jul 20 2017 19:28:17 GMT+0800 (China Standard Time)

Hi @whgibbo

I haven't received any news, and I'm still using the proposed work around.

Viz · Answer 14 · Mon Oct 02 2017 13:59:31 GMT+0800 (China Standard Time)

Problem still exists for docker version 17.09.0-ce in CentOS 7.

The basic idea is: containers should be able to access published ports as other hosts do on the Internet. It's unreasonable to block containers from accessing public ports.

Viz · Answer 15 · Mon Oct 02 2017 18:21:43 GMT+0800 (China Standard Time)

@jcberthon I'm trying to propose a PR to resolve this issue since I think adding interface docker0 to trusted zone of firewalld is not a good solution.

Could you provide the output of iptables-save of a host with interface docker0 added to the trusted zone of firewalld?

Also,

Update 20170330: actually only the second command is enough, adding docker0 to the trusted zone has no effect.

Are you sure about that? Could you check if you already have docker0 added to the trusted zone with firewall-cmd --get-zone-of-interface=docker0.

Jean-Christophe · Answer 16 · Sat Oct 21 2017 05:07:49 GMT+0800 (China Standard Time)

Hi @vizv

Sorry for the long delay. I haven't had a lot of free time this last 2 weeks and spent more time with my family.
It's possible that next week I'll have some time to answer you.

I will create again a droplet with the configuration as I described it and send you the information.

Vedmant · Answer 17 · Fri May 18 2018 17:55:53 GMT+0800 (China Standard Time)

I have the same problem, tried multiple workarounds, nothing worked yet.

Viz · Answer 18 · Sat May 19 2018 06:52:46 GMT+0800 (China Standard Time)

@vedmant Have you tried apply this patch moby/libnetwork#1963 and recompile docker? Or you can manually fix iptables rules

Vedmant · Answer 19 · Sat May 19 2018 16:28:02 GMT+0800 (China Standard Time)

@vizv I wanted to stick to official build to be able to update it regularly easily. All I need is to close all public ports on machine except a few like ssh, http, https. But keep Docker containers able to connect to database that works on the host machine. Is there some other possible way other than compiling docker manually?

Viz · Answer 20 · Sun Jun 17 2018 01:57:51 GMT+0800 (China Standard Time)

@vedmant as I mentioned, you can delete related entries in iptables

eltonplima · Answer 21 · Mon Jul 09 2018 04:32:00 GMT+0800 (China Standard Time)

Hey guys!
I faced the same problem using SIP, to solve I used the following command:

sudo firewall-cmd --add-service=sip --permanent
sudo firewall-cmd --reload

On your case @jcberthon try this:

sudo firewall-cmd --add-service=http --permanent
sudo firewall-cmd --reload

Krzysztof Makowski · Answer 22 · Tue Oct 23 2018 22:33:52 GMT+0800 (China Standard Time)

@eltonplima save my day i trying 6h to fix it! THANKS!

But in my case i need to use:
sudo firewall-cmd --add-service=https --permanent
sudo firewall-cmd --reload

Jean-Christophe · Answer 23 · Thu Jan 31 2019 22:21:49 GMT+0800 (China Standard Time)

@eltonplima thank you for the hint. It is similar to my own workaround, see #32138 (comment) the example I gave was to limit access to the service, but of course you can open it fully.

Yuta Hidaka · Answer 24 · Fri Feb 07 2020 10:15:30 GMT+0800 (China Standard Time)

Hi there!

I had same issue.
But I add this.

I resoleved !

firewall-cmd --zone=public --add-masquerade --permanent
firewall-cmd --reload
systemctl restart docker

if not work that try this.


firewall-cmd --permanent --zone=trusted --add-interface=docker0
firewall-cmd --permanent --zone=trusted --add-port=4243/tcp

firewall-cmd --reload
systemctl restart docker

Hope it is help your issue.

Candid Dauth · Answer 25 · Tue Jun 02 2020 21:54:44 GMT+0800 (China Standard Time)

Any updates on this? I am experiencing this problem on a system that doesn't use firewalld but only plain iptables. This really is a blocker.

I have a mail server and several other services running on a host. The mail server runs in a different docker bridge network than the other services, but publicly exposes its ports. The other services should be able to access the mail server using the publicly exposed ports, but no connection can be made.