DNS Issue, requires container restart every hour

Question

DNS Issue, requires container restart every hour

BrokenRunner opened this issue 4 years ago · comments

Host: Ubuntu 20.04
Reverse SSL Proxy: Apache

Describe the bug:
At random intervals, DNS stops resolving for clients over wireguard tunnel. Non-Domain-Joined Windows based clients also appear to not get the search domain from the host. Unix-based clients search domain and DNS work fine. All Windows Clients are using Wireguard application 0.1.1 (latest as at time of writing).
Using a local nameserver in my docker-compose.

Docker Compose:

version: "3.3"
services:
  subspace:
   image: subspacecommunity/subspace:latest
   container_name: subspace
   volumes:
    - /usr/bin/wg:/usr/bin/wg
    - /opt/subspace/data:/data
    - /lib/x86_64-linux-gnu/libc.so.6:/lib/x86_64-linux-gnu/libc.so.6:ro
    - /lib64/ld-linux-x86-64.so.2:/lib64/ld-linux-x86-64.so.2:ro
   restart: always
   environment:
    - SUBSPACE_HTTP_HOST=xxxxx.xxxxxxx.com
    - SUBSPACE_LETSENCRYPT=false
    - SUBSPACE_HTTP_INSECURE=true
    - SUBSPACE_HTTP_ADDR=":8333"
    - SUBSPACE_NAMESERVER=172.16.10.11
    - SUBSPACE_LISTENPORT=51820
    - SUBSPACE_IPV4_POOL=10.99.97.0/24
    - SUBSPACE_IPV6_POOL=fd00::10:97:0/64
    - SUBSPACE_IPV4_GW=10.99.97.1
    - SUBSPACE_IPV6_GW=fd00::10:97:1
    - SUBSPACE_IPV6_NAT_ENABLED=0
   cap_add:
    - NET_ADMIN
   network_mode: "host"

To Reproduce:
Steps to reproduce the behavior:
Connect a client, kill switch disabled, otherwise all standard settings.
Client will connect and work for a period of time without issue.
Non-Domain-Joined Windows clients only able to use FQDN to resolve hosts, search domain doesnt work
After a somewhat random period of time, DNS appears to stop resolving and clients lose access to internal resources. Pinging an internal IP continues to work, indicating tunnel hasnt been dropped.
Some have complained of losing all local Internet access as well, despite Kill-switch being unticked in the client GUI. Havent been able to confirm this.

To work around this issue, I have setup a cron script running hourly to restart the subspace container on the host. This has so far resolved the issue, but is not desirable in the long term (or really even in the short term).
This makes me feel like the issue is NOT related to the client GUI as I thought it might have been initially.

Has anyone else experienced anything like this before?
All instructions have been followed as per install document, systemd-resolved has been removed, resolv.conf has been hard coded with local information due to Ubuntu constantly emptying it.

I should add that I cant see anything in system logs or docker logs on the Host.

James Broadshaw · Answer 1 · Fri Nov 06 2020 11:36:48 GMT+0800 (China Standard Time)

Seeing the same issue. Are you using the existing dockerfile or did you update to latest alpine?

Any ideas appreciated as restarting is far from ideal.

Doug Watson · Answer 2 · Fri Nov 06 2020 15:03:54 GMT+0800 (China Standard Time)

I havnt changed anything, whatever subspacecommunity/subspace:latest pulls is what im using.

What I can say is further testing shows MacOS also has DNS Search Domain Issue (Appears this works ONLY for linux based peers),
Enabling the "Exclude Private IP's" tickbox in the MacOS Wireguard client (v0.0.20191105) kills everything. No local DNS resolution, No access to remote resources, nothing. This likely isnt Subspace but the MacOS client instead.

How Can I test the Alpine version? Sorry noob question.

Ben · Answer 3 · Sat Nov 07 2020 00:44:08 GMT+0800 (China Standard Time)

Try killing systemd-resolved on the host, you may have multiple DNS services listening on same interface :
sudo netstat -tanpu | grep LISTEN|grep 53 would show you listening services.

Any other service than dnsmasq from docker instance should be killed (disabled) with sudo systemctl stop <service> and then sudo systemctl disable <service>

Hope it helps

James Broadshaw · Answer 4 · Sat Nov 07 2020 03:16:32 GMT+0800 (China Standard Time)

Systemd-resolved was disabled as per setup instructions. DNS actually works for a while and then stops. Dnsmasq is still running but no responses come through.

Doug Watson · Answer 5 · Mon Nov 09 2020 08:32:03 GMT+0800 (China Standard Time)

Yes I can confirm the same, systemd-resolved is disabled and stopped.

This morning I started systemd-resolved, rebooted the server,
Stopped systemd-resolved, rebooted the server.
Disabled my Cron job rebooting the subspace docker container every hour.

Asked several staff to test for me (~10am). The first came back to me @ 12:45pm stating they had lost internet access (DNS stopped working is what they mean to say..).

Ran my restart-script the cron job calls:

#!/bin/bash /usr/bin/docker restart subspace | /opt/timestamp.sh >>/opt/subspace/data/logs/cron.log

Checked with the staff member = they were working again instantly.
Its not the client, its the container/image. Or should I say, it appears to be dnsmasq INSIDE the container. This has been this way since day one.

I think having the ability to remove dnsmasq and utilise something external to the container would be the best option, not sure how easy that is.

Doug Watson · Answer 6 · Mon Nov 09 2020 08:32:54 GMT+0800 (China Standard Time)

For completeness:

sudo netstat -tanpu | grep LISTEN|grep 53
tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN      505201/dnsmasq
tcp6       0      0 :::53                   :::*                    LISTEN      505201/dnsmasq

Ben · Answer 7 · Mon Nov 09 2020 19:53:08 GMT+0800 (China Standard Time)

Just to be sure, your ubuntu is running on bare metal install ? Inside VM ?

I got issues with network settings behind a NAT correctly configured in vbox hyperviser.
This led to exactly same symptoms : DNS stopped working after a while, with no obvious reasons.

Switching to bare metal solved multiple issues (i know it may seem obvious but i had to give it a try :) ) and my setup is now working like a charm. Modulo some rare DNS issues caused by unattended-upgrades waking up some unwanted services. My bad, though, nothing related to subspace or wireguard.

Ben · Answer 8 · Mon Nov 09 2020 20:04:50 GMT+0800 (China Standard Time)

BTW, it could be worth trying to disable (temporarily) unattended-upgrades on your host ?
This is quite debian / ubuntu specific, and could be the cause of some ... weirdnesses

Doug Watson · Answer 9 · Wed Nov 11 2020 15:30:32 GMT+0800 (China Standard Time)

Apologies for the late reply,

Ubuntu is inside VMware ESXi 6.7, behind Fortigate Firewall. I dont really have the desire to move it to bare-metal, nor do I think it should be necessary. I think having the ability to utilise DNS outside the container would provide.\

I have disabled unattended-upgrades.
The interesting this is after enabling and disabling systemd-resolved as above, im now able to utilise the search domain across clients, which is interesting..

IndevinDCW · Answer 10 · Mon Feb 15 2021 04:24:27 GMT+0800 (China Standard Time)

Please see #144 - This appears to be the solution after preliminary testing.

IndevinDCW · Answer 11 · Wed Feb 17 2021 17:07:28 GMT+0800 (China Standard Time)

Spoke too soon, issue remains.

Alex Korotysh · Answer 12 · Tue Apr 20 2021 23:16:01 GMT+0800 (China Standard Time)

Hi all, I've fixed this issue here
#144 (comment)

Charles Lynch · Answer 13 · Thu Sep 16 2021 02:40:16 GMT+0800 (China Standard Time)

I recently setup a subspace server for a smaller group of people (~12) and hit this same issue. I can share what I found and fixed to resolve it for myself in the hopes that it may help someone else. I set this up on a very small vps since it uses practically no resources and, depending on our usage / traffic on the vpn, we would hit lack of dns resolution every couple hours. I setup nodeexporter and found the following

As the metrics show, I was out of ram for udp buffers and all udp traffic was being dropped. Restarting the container clears all this up but only temporarily. rmem is set to a measly 25mb by default so I upped it and haven't had the issue return since.

net.core.rmem_max=26214400
net.core.rmem_default=26214400

Hopefully this helps someone