Azure / iotedge

The IoT Edge OSS project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EdgeAgent does not restart crashed module if busy downloading new manifest 1.4.10

mathiash98 opened this issue · comments

Expected Behavior

EdgeAgent should ALWAYS keep all modules running regardless of edgeAgent operation.

Current Behavior

If EdgeAgent is busy downloading new version of modules it will not ensure existing modules are running.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Slow internet setup
  2. Deploy new version deployment manifest
  3. Ensure EdgeAgent struggles downloading new versions (100 second timeout) -> repeats for a long time
  4. Crash one of the existing modules with fatal exception or similar
  5. Observe that EdgeAgent does not restart the crashing module

Context (Environment)

Deployments operating on slow satellite internet connection with high ping.

Deployment manifest has:

  • "restartPolicy": "always",
  • "status": "running",
  • ModuleUpdateMode: WaitForAllPulls

Output of iotedge check

Click here
Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
× config.yaml has correct hostname - Error
    config.yaml has hostname PC2FD48A but device reports hostname pc2fd48a.
    Hostname in config.yaml must either be identical to the device hostname or be a fully-qualified domain name that has the device hostname as the first component.
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
√ latest security daemon - OK
√ host time is close to real time - OK
√ container time is close to host time - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: certificates - Warning
    The Edge device is using self-signed automatically-generated development certificates.
    They will expire in 204 days (at 2024-10-26 11:29:55 UTC) causing module-to-module and downstream device communication to fail on an active deployment.
    After the certs have expired, restarting the IoT Edge daemon will trigger it to generate new development certs.
    Please consider using production certificates instead. See https://aka.ms/iotedge-prod-checklist-certs for best practices.
√ production readiness: logs policy - OK
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
    The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
    The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.

Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK

17 check(s) succeeded.
4 check(s) raised warnings. Re-run with --verbose for more details.
1 check(s) raised errors. Re-run with --verbose for more details.

Device Information

  • Host OS [e.g. Ubuntu 22.04, Windows Server IoT 2019]:
  • Architecture [e.g. amd64, arm32, arm64]:
  • Container OS [e.g. Linux containers, Windows containers]:

Runtime Versions

  • aziot-edged [run iotedge version]: 1.1.15
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.4.10
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.4.10
  • Docker/Moby [run docker version]: 20.10.20+azure-1

Logs

edge-agent logs

<6> 2024-04-08 07:39:46.633 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 36 and reported properties version 321.
<6> 2024-04-08 07:42:30.361 +00:00 [INF] - Starting compaction of stores
<6> 2024-04-08 07:42:30.361 +00:00 [INF] - Starting compaction of store Metrics
<6> 2024-04-08 07:42:30.362 +00:00 [INF] - Starting compaction of store deploymentConfig
<6> 2024-04-08 07:42:30.362 +00:00 [INF] - Starting compaction of store default
<6> 2024-04-08 07:42:30.362 +00:00 [INF] - Starting compaction of store moduleState
<6> 2024-04-08 07:43:13.374 +00:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2024-04-08 07:43:13.375 +00:00 [INF] - Scraping Metrics
<6> 2024-04-08 07:43:13.375 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2024-04-08 07:43:13.375 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2024-04-08 07:43:13.385 +00:00 [INF] - Storing Metrics
<6> 2024-04-08 07:43:13.389 +00:00 [INF] - Scraped and Stored Metrics
<6> 2024-04-08 07:43:13.389 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
<4> 2024-04-08 07:43:48.781 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 07:53:53.787 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 08:03:58.789 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 08:14:03.794 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 08:24:08.798 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<6> 2024-04-08 08:28:48.987 +00:00 [INF] - Starting periodic operation refresh twin config...
<6> 2024-04-08 08:28:51.738 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 36 and reported properties version 321.
<6> 2024-04-08 08:28:51.740 +00:00 [INF] - Successfully completed periodic operation refresh twin config
<4> 2024-04-08 08:34:13.801 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<6> 2024-04-08 08:43:13.393 +00:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2024-04-08 08:43:13.393 +00:00 [INF] - Scraping Metrics
<6> 2024-04-08 08:43:13.393 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2024-04-08 08:43:13.394 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2024-04-08 08:43:13.408 +00:00 [INF] - Storing Metrics
<6> 2024-04-08 08:43:13.414 +00:00 [INF] - Scraped and Stored Metrics
<6> 2024-04-08 08:43:13.414 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
<4> 2024-04-08 08:44:18.807 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 08:54:23.813 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 09:04:28.820 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271
<4> 2024-04-08 09:14:33.823 +00:00 [WRN] - Agent reconcile concluded with errors.
System.TimeoutException: Operation timed out
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.<>c.<.cctor>b__27_0() in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 114
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter(Task task, TimeSpan timeout, Action action) in /mnt/vss/_work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 160
   at Microsoft.Azure.Devices.Edge.Agent.Service.Program.MainAsync(IConfiguration configuration) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Service/Program.cs:line 271

@jollyaakash can you help take a look at this?

Hey @mathiash98 thanks for reporting this. It looks like an existing problem where the pull operation happens synchronously and prevents EdgeAgent reacting to other module lifecycle events. I'll bring it up with the team to see how we can fix that. Meanwhile I can suggest a workaround (if applicable/possible) is to ssh to the device and manually pre-pull (docker pull) the images before applying the deployment. Also you can set imagePullPolicy: Never for that.

Just a side note about ModuleUpdateMode: WaitForAllPulls. It means that all images will be downloaded first, and then containers (re)started, but EA still makes pull synchronously.

Hey @mathiash98 thanks for reporting this. It looks like an existing problem where the pull operation happens synchronously and prevents EdgeAgent reacting to other module lifecycle events. I'll bring it up with the team to see how we can fix that. Meanwhile I can suggest a workaround (if applicable/possible) is to ssh to the device and manually pre-pull (docker pull) the images before applying the deployment. Also you can set imagePullPolicy: Never for that.

Just a side note about ModuleUpdateMode: WaitForAllPulls. It means that all images will be downloaded first, and then containers (re)started, but EA still makes pull synchronously.

Thanks for the information, we will try to keep an eye out for the installation with poor internet connection. Manually polling on 100 devices is too cumbersome, but we will create some dashboard that will show the devices that IotEdge is reporting as failing to mitigate this

@vadim-kovalyov our dashboard is helping us identifying the devices that need help. But it is an ongoing issue everytime we release new software for a week or two as these devices can have extremely slow internet for 1-2 weeks before the vessel is back in port.

Are there any progress on this issue?

@vadim-kovalyov any update on this issue?

Hi @mathiash98, glad to hear that you have the dashboard that's working as a potential workaround to identify the devices that need help with. @vadim-kovalyov and I discussed this issue briefly. Unfortunately, this is a very rare case and fixing it will require a lot of effort including rework on the edgeAgent logic. We don't have any plans to fix this at the moment and will put a work item in our backlog to revisit this in the future.

Closing this issue as discussed above.