Azure / iotedge

The IoT Edge OSS project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Intermittent DNS issue in Edge modules

bhjertaas opened this issue · comments

Expected Behavior

Edge modules (the Docker containers) should be able to make requests to online resources, such as the IoT Hub, without having to use IP address. They are expected to resolve domain names to IP addresses with the defined Cloudflare and Google DNS services.

Current Behavior

For unknown reasons, DNS stops working now and then. Upon reboot of the device it is no longer a problem.
When this happens, the communication with IoT Hub fails. DNS works on Windows host, Linux VM, but not any modules.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Ping from the host Windows OS (from a Powershell terminal for example) ping example.com
    This works find
  2. Ping from the EFLOW Linux VM ping example.com after doing Connect-EflowVm. We also did systemd-resolve example.com
    This works fine
  3. Ping from an Edge module. One way to do this is the following command sudo docker exec edgeAgent ping example.com
    This fails
    Ping an IP address works fine

Context (Environment)

Output of iotedge check

Click here

Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
‼ aziot-identity-service package is up-to-date - Warning
    Installed aziot-identity-service package has version 1.4.6 but 1.4.7 is the latest stable version available.
    Please see https://aka.ms/aziot-update-runtime for update instructions.
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
‼ host can connect to and perform TLS handshake with iothub AMQP port - Warning
    Could not retrieve iothub_hostname from provisioning file.
    Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information.
    Since no hostname is provided, all hub connectivity tests will be skipped.
‼ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - Warning
    Could not retrieve iothub_hostname from provisioning file.
    Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information.
    Since no hostname is provided, all hub connectivity tests will be skipped.
‼ host can connect to and perform TLS handshake with iothub MQTT port - Warning
    Could not retrieve iothub_hostname from provisioning file.
    Please specify the backing IoT Hub name using --iothub-hostname switch if you have that information.
    Since no hostname is provided, all hub connectivity tests will be skipped.
√ host can connect to and perform TLS handshake with DPS endpoint - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
‼ aziot-edge package is up-to-date - Warning
    Installed IoT Edge daemon has version 1.4.20 but 1.4.27 is the latest stable version available.
    Please see https://aka.ms/iotedge-update-runtime for update instructions.
√ container time is close to host time - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
√ production readiness: logs policy - OK
√ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK
√ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
25 check(s) succeeded.
6 check(s) raised warnings. Re-run with --verbose for more details.
7 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.

Device Information

  • Host OS [e.g. Ubuntu 22.04, Windows Server IoT 2019]: EFLOW 1.4.10.25103 on Windows 10 and 11
  • Architecture [e.g. amd64, arm32, arm64]: amd64
  • Container OS [e.g. Linux containers, Windows containers]: EFLOW (Mariner Linux)

Runtime Versions

  • aziot-edged [run iotedge version]: 1.4.20
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.4.25.82955152
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.4
  • Docker/Moby [run docker version]: 20.10.25

Note: when using Windows containers on Windows, run docker -H npipe:////./pipe/iotedge_moby_engine version instead

Logs

aziot-edged logs

<Paste here between the triple backticks>

edge-agent logs

<Paste here between the triple backticks>

edge-hub logs

Error creating cloud connection for client E96B7EE5-4697-4F1D-894D-33C76360E3DC/moduleClientBridge
--> System.Net.Sockets.SocketException (11): Resource temporarily unavailable
Microsoft.Azure.Devices.Client.Exceptions.IotHubCommunicationException: Transient network error occurred, please retry.

Additional Information

We do see the DNS warning in the iotedge check output, but DNS is working on the device, also within modules. We have tried to add DNS setting to daemon.json file as mentioned on 'Solution to common issues', but it does not seem to help. We've still seen intermittent DNS failure.

When we had an ongoing issue we executed the following commands, if that can be of interest.
From within a module
cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0

From Linux Mariner
cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad

In /etc/systemd/resolved.conf we see everything commented out, except LLMNR=false

resolvectl status does say that fallback DNS servers are "1.1.1.1#cloudflare-dns.com 8.8.8.8#dns.google 1.0.0.1#cloudflare-dns.com...."
but the same output only indicates that DNS Servers: 172.19.224.1

We have now added
"dns": ["1.1.1.1", "8.8.8.8"]
to daemon.json files on all devices. Hopefully that will help, but as mentioned we have seen the reported issue on a device that had this configured already.
Another thing, it would make sense to mention the DNS config on pre-production checklist page not only troubleshoot guide

@Azure/iotedge-eflow and @josephknierman can you please help take a look and/or transfer?

@bhjertaas , can you confirm this is limited to IoTEdge containers? or observed on other Docker based containers as well?

Do you mean that it is limited to edgeAgent and edgeHub, not our other custom Edge modules?
If yes, then we saw it happening on all containers. I've simply used edgeAgent here as an example because that comes with ping already built in.
However, it may be that this issue was resolved by adding DNS servers to daemon.json file (as explained in comment above) because we have not experienced this issue after this was done.
Nonetheless, this should be added to your documentation because then it proves that adding dns entry in daemon.json (or to each module config in deployment.template.json) is required for obtaining a stable production scenario.

@bhjertaas just to be sure: you added 1.1.1.1 to daemon.json as per troubleshooting guide, but it didn't help. Once you added 8.8.8.8 also, the issue was resolved? Did I get that right?

Not quite. We had no DNS config in daemon.json to start with, and adding "dns": ["1.1.1.1", "8.8.8.8"] seemed to help. We have not tried only "dns": ["1.1.1.1"]

Thanks.

@PatAltimore would you mind helping with improving our docs in the pre-production checklist page (probably the networking section) to also contain the information from troubleshoot guide, specifically the "Set DNS server in container engine settings" part?

@jlian , since DNS wss not configured in daemon.json, it should have repro'd 100% right. Can it be intermittent?

Having no DNS config in daemon.json did not mean intermittent in the sense that request 1 failed, but request 2 was successful. It was intermittent over periods of time. Once the problem occurred, every request would fail until the PC was rebooted.