Azure / iotedge

The IoT Edge OSS project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EdgeHub is unable to reauthenticated connected clients

MattCosturos opened this issue · comments

Expected Behavior

Modules should remain running and connected to the iot hub

Current Behaviora

During the periodic task to reauthenticate connected clients, the connection fails

Steps to Reproduce

Unable to reproduce on demand. Failures are seemingly random. Sometimes it will fail on the first renewal (after 1 hour) sometimes it will fail on the 2nd renewal (after 2), etc

Context (Environment)

Output of iotedge check

Click here

Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
√ aziot-identity-service package is up-to-date - OK
‼ host time is close to reference time - Warning
    Could not query NTP server
        caused by: Could not query NTP server
        caused by: could not receive NTP server response: Resource temporarily unavailable (os error 11)
        caused by: Resource temporarily unavailable (os error 11)
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
√ host can connect to and perform TLS handshake with iothub AMQP port - OK
√ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with iothub MQTT port - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
√ aziot-edge package is up-to-date - OK
√ container time is close to host time - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
        caused by: Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
                   Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
                   You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: logs policy - Warning
    Container engine is not configured to rotate module logs which may cause it run out of disk space.
    Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
    You can ignore this warning if you are setting log policy per module in the Edge deployment.
        caused by: Container engine is not configured to rotate module logs which may cause it run out of disk space.
                   Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
                   You can ignore this warning if you are setting log policy per module in the Edge deployment.
√ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK
√ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK
√ Agent image is valid and can be pulled from upstream - OK
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
√ container on the default network can connect to upstream AMQP port - OK
√ container on the default network can connect to upstream HTTPS / WebSockets port - OK
√ container on the default network can connect to upstream MQTT port - OK
    skipping because of not required in this configuration
√ container on the IoT Edge module network can connect to upstream AMQP port - OK
√ container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to upstream MQTT port - OK
    skipping because of not required in this configuration
32 check(s) succeeded.
3 check(s) raised warnings.
2 check(s) were skipped due to errors from other checks.

Device Information

  • Host OS [e.g. Ubuntu 22.04, Windows Server IoT 2019]:
  • Distributor ID: Ubuntu
  • Description: Ubuntu 20.04.6 LTS
  • Release: 20.04
  • Codename: focal
  • Architecture [e.g. amd64, arm32, arm64]: amd64
  • Container OS [e.g. Linux containers, Windows containers]: linux containers

Runtime Versions

  • aziot-edged [run iotedge version]: iotedge 1.4.20
  • Edge Agent [image tag (e.g. 1.0.0)]: Tag used is 1.4, latest image is being used
  • Edge Hub [image tag (e.g. 1.0.0)]: Tag used is 1.4, latest image is being used
  • Docker/Moby [run docker version]:
docker version

Client:
 Version:           20.10.18+azure-1
 API version:       1.41
 Go version:        go1.18.6
 Git commit:        b40c2f6b5deeb11ac6c485c940865ee40664f0f0
 Built:             Thu Sep  8 08:19:02 UTC 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          24.0.7-1
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       311b9ff0aa93aa55880e1e5f8871c4fb69583426
  Built:            Thu Oct 26 07:51:05 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.13+azure-1
  GitCommit:        a17ec496a95e55601607ca50828147e8ccaeebf1
 runc:
  Version:          1.1.4
  GitCommit:        5fd4c4d144137e991c4acebb2146ab1483a97925
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Please see edgeHub.txt for the edge hub logs showing all the connection failures.

Recovery happens when restarting the edgeHub module
OR
randomly, but possibly when I run iotedge check ( Perhaps the iot hub connection check restores the network connection for edgeHub? )

edgeAgent.txt
azedged.txt
azidentity.txt
edgeHub.txt

@MattCosturos from the logs it looks like what happens is there are connection issues and because of that, reauth cannot happen. Just to confirm, are you saying that it never recovers unless you restart edgeHub module / run iotedge check or that it does eventually restore on its own?

In fact, if it never recovers on its own, would you mind getting debug level logs if possible?

The iotedge check restoring connectivity is super weird and not expected... Would also be curious if any network connectivity works on the device when you see this problem, since you say you can restart EH I assume you can get to the device. So basically, if you see it happening, don't run iotedge check verify with curl or something that network still is okay and then see if EH recovers on its own.

Hello @nyanzebra

Now this is getting weird. Device was in a disconnected state (has been disconnected for ~14 hours).
I attempted to
curl https://packages.microsoft.com/ubuntu/23.10/prod/pool/main/a/aspnetcore-runtime-6.0/aspnetcore-runtime-6.0_6.0.25-1_amd64.deb --output asp-runtime

It just hung (0 progress on the download) I tried several times.

Then I started logging packets with tcpdump and attempted to curl again. It suddenly worked, and edgeHub recovered on its own. I did nothing to restart iotedge daemons, or the system modules.

Everything I am debugging points to my network or network configuration being the cause of the issue.
I am going to close this issue, I just wanted to get someone to look at the logs, and make sure there wasn't some edge runtime issue. But I am pretty sure every error in the logs stems from the initial connection issue.

@MattCosturos sounds good, feel free to reopen anytime if have further questions or issues :)