zigpy / zigpy-xbee

A library which communicates with XBee radios for zigpy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Watchdog failures

puddly opened this issue · comments

#170 added a watchdog command (VR) that is sent every 30s to ensure the radio is still alive. It seems that the XBee serial protocol can't handle this and the command times out, causing a restart. If I add an asyncio.Lock around _at_partial it doesn't seem to help either.

@Shulyaka Are you familiar with the serial protocol? Do you happen to know why this would be the case?

Below is my patchset to enable a send lock:

diff --git a/zigpy_xbee/api.py b/zigpy_xbee/api.py
index b4a73da..51ba61f 100644
--- a/zigpy_xbee/api.py
+++ b/zigpy_xbee/api.py
@@ -289,6 +289,7 @@ class XBee:
         self._cmd_mode_future: Optional[asyncio.Future] = None
         self._reset: asyncio.Event = asyncio.Event()
         self._running: asyncio.Event = asyncio.Event()
+        self._send_lock = asyncio.Lock()
 
     @property
     def reset_event(self):
@@ -353,12 +354,13 @@ class XBee:
         LOGGER.debug("Remote AT command: %s %s", name, args)
         data = t.serialize(args, (AT_COMMANDS[name],))
         try:
-            return await asyncio.wait_for(
-                self._command(
-                    "remote_at", ieee, nwk, options, name.encode("ascii"), data
-                ),
-                timeout=REMOTE_AT_COMMAND_TIMEOUT,
-            )
+            async with self._send_lock:
+                return await asyncio.wait_for(
+                    self._command(
+                        "remote_at", ieee, nwk, options, name.encode("ascii"), data
+                    ),
+                    timeout=REMOTE_AT_COMMAND_TIMEOUT,
+                )
         except asyncio.TimeoutError:
             LOGGER.warning("No response to %s command", name)
             raise
@@ -367,10 +369,11 @@ class XBee:
         LOGGER.debug("%s command: %s %s", cmd_type, name, args)
         data = t.serialize(args, (AT_COMMANDS[name],))
         try:
-            return await asyncio.wait_for(
-                self._command(cmd_type, name.encode("ascii"), data),
-                timeout=AT_COMMAND_TIMEOUT,
-            )
+            async with self._send_lock:
+                return await asyncio.wait_for(
+                    self._command(cmd_type, name.encode("ascii"), data),
+                    timeout=AT_COMMAND_TIMEOUT,
+                )
         except asyncio.TimeoutError:
             LOGGER.warning("%s: No response to %s command", cmd_type, name)
             raise
@@ -597,9 +600,3 @@ class XBee:
                 raise APIException("Failed to configure XBee for API mode")
         finally:
             self.close()
-
-    def __getattr__(self, item):
-        """Handle supported command requests."""
-        if item in COMMAND_REQUESTS:
-            return functools.partial(self._command, item)
-        raise AttributeError(f"Unknown command {item}")
diff --git a/zigpy_xbee/zigbee/application.py b/zigpy_xbee/zigbee/application.py
index 2158b95..4ad8e62 100644
--- a/zigpy_xbee/zigbee/application.py
+++ b/zigpy_xbee/zigbee/application.py
@@ -302,7 +302,8 @@ class ControllerApplication(zigpy.application.ControllerApplication):
                 "Cannot send a packet to a device without a known IEEE address"
             )
 
-        send_req = self._api.tx_explicit(
+        send_req = self._api._command(
+            "tx_explicit",
             long_addr,
             short_addr,
             packet.src_ep or 0,
@@ -356,7 +357,7 @@ class ControllerApplication(zigpy.application.ControllerApplication):
         # Key type:
         # 0 = Pre-configured Link Key (KY command of the joining device)
         # 1 = Install Code With CRC (I? command of the joining device)
-        await self._api.register_joining_device(node, reserved, key_type, link_key)
+        await self._api._command("register_joining_device", node, reserved, key_type, link_key)
 
     def handle_modem_status(self, status):
         """Handle changed Modem Status of the device."""

I am already looking into this issue, don't know the root cause yet.

From my logs I can see that the protocol is correct, but the response is significantly delayed:

2023-12-09 14:38:49.048 DEBUG (MainThread) [zigpy_xbee.api] at command: VR ()
2023-12-09 14:38:49.050 DEBUG (MainThread) [zigpy_xbee.api] Command at (b'VR', b'')
2023-12-09 14:38:49.051 DEBUG (MainThread) [zigpy_xbee.uart] Sending: b'\x08\x06VR'
2023-12-09 14:38:49.063 DEBUG (MainThread) [zigpy_xbee.uart] Frame received: b'\x88\x06VR\x00\x10\x12'
2023-12-09 14:38:49.063 DEBUG (MainThread) [zigpy_xbee.api] Frame received: at_response
...
2023-12-09 14:38:54.103 DEBUG (MainThread) [zigpy_xbee.api] at command: VR ()
2023-12-09 14:38:54.103 DEBUG (MainThread) [zigpy_xbee.api] Command at (b'VR', b'')
2023-12-09 14:38:54.103 DEBUG (MainThread) [zigpy_xbee.uart] Sending: b'\x08*VR'
...
2023-12-09 14:39:04.105 WARNING (MainThread) [zigpy_xbee.api] at: No response to VR command
2023-12-09 14:39:04.105 WARNING (MainThread) [zigpy.backups] Failed to create a network backup
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/tasks.py", line 500, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy/backups.py", line 220, in _backup_loop
    await self.create_backup()
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy/backups.py", line 143, in create_backup
    await self.app.load_network_info(load_devices=load_devices)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/zigbee/application.py", line 144, in load_network_info
    version = await self._api._at_command("VR")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 370, in _at_partial
    return await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/tasks.py", line 502, in wait_for
    raise exceptions.TimeoutError() from exc
TimeoutError
...
2023-12-09 14:39:23.820 DEBUG (MainThread) [zigpy_xbee.api] at command: VR ()
2023-12-09 14:39:23.820 DEBUG (MainThread) [zigpy_xbee.api] Command at (b'VR', b'')
2023-12-09 14:39:23.820 DEBUG (MainThread) [zigpy_xbee.uart] Sending: b'\x088VR'
...
2023-12-09 14:39:33.825 WARNING (MainThread) [zigpy_xbee.api] at: No response to VR command
2023-12-09 14:39:33.825 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/tasks.py", line 500, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy/application.py", line 656, in _watchdog_loop
    await self._watchdog_feed()
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/zigbee/application.py", line 55, in _watchdog_feed
    await self._api._at_command("VR")
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 370, in _at_partial
    return await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/tasks.py", line 502, in wait_for
    raise exceptions.TimeoutError() from exc
TimeoutError
2023-12-09 14:39:33.840 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError()
2023-12-09 14:39:33.841 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError()
2023-12-09 14:39:33.842 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop
...
2023-12-09 14:39:45.829 DEBUG (MainThread) [zigpy_xbee.uart] Frame received: b'\x88*VR\x00\x10\x12'
2023-12-09 14:39:45.829 DEBUG (MainThread) [zigpy_xbee.api] Frame received: at_response
2023-12-09 14:39:45.829 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/srv/homeassistant/lib/python3.11/site-packages/serial_asyncio_fast/__init__.py", line 137, in _read_ready
    self._protocol.data_received(data)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/uart.py", line 95, in data_received
    self.frame_received(frame)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/uart.py", line 103, in frame_received
    self._api.frame_received(frame)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 392, in frame_received
    getattr(self, f"_handle_{command}")(*data)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 414, in _handle_at_response
    fut.set_result(response)
asyncio.exceptions.InvalidStateError: invalid state
...
2023-12-09 14:40:00.257 DEBUG (MainThread) [zigpy_xbee.uart] Frame received: b'\x888VR\x00\x10\x12'
2023-12-09 14:40:00.258 DEBUG (MainThread) [zigpy_xbee.api] Frame received: at_response
2023-12-09 14:40:00.258 ERROR (MainThread) [homeassistant] Error doing job: Exception in callback SerialTransport._read_ready()
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/srv/homeassistant/lib/python3.11/site-packages/serial_asyncio_fast/__init__.py", line 137, in _read_ready
    self._protocol.data_received(data)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/uart.py", line 95, in data_received
    self.frame_received(frame)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/uart.py", line 103, in frame_received
    self._api.frame_received(frame)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 392, in frame_received
    getattr(self, f"_handle_{command}")(*data)
  File "/srv/homeassistant/lib/python3.11/site-packages/zigpy_xbee/api.py", line 414, in _handle_at_response
    fut.set_result(response)
asyncio.exceptions.InvalidStateError: invalid state

14:38:49.051 - Sending VR on init
14:38:49.063 - Response to VR (12 ms)
14:38:54.103 - Sending VR on watchdog
14:39:23.820 - Sending VR again
14:39:45.829 - Response to the second VR (51.726 s, after timeout)
14:40:00.257 - Response to the third VR (36.437 s, after timeout)

It is interesting that I don't have this issue on my testing environment, only on production.

I have no knowledge of the module internals, but my best theory is that it puts all requests in a queue. On startup, we are doing a lot of attribute reads, which takes time (especially if a remote device is off or has poor connection quality), so the response even for a local command gets delayed. The lock didn't help because it didn't cover all possible frames, only commands. I think that the module is capable of processing several requests in parallel, but when their number gets high enough we have a queue saturation condition.

So I would suggest two things to address it:

  1. Increase AT_COMMAND_TIMEOUT and REMOTE_AT_COMMAND_TIMEOUT to 60 or even more (it would depend on the number of devices in the network for which we perform attribute read on startup - usually the mains powered ones; 60 works in my case). May be even dynamically calculate the timeout (e.g. 5 * (1 + len(self._awaiting))).
  2. Don't do the watchdog before zha fully initializes (not sure if we can know the zha component state inside zigpy)

@Shulyaka If you have time to test, see if zigpy==0.60.1 (https://github.com/zigpy/zigpy/releases/tag/0.60.1) fixes things for you. I think pausing the watchdog during high-load periods will be enough of a stopgap to fix the current issue.

Will test today.

Nope, it does not solve the issue for XBee.
It's not a concurrency problem, the device receives the request. but it takes it too much time to respond.
Can we try a different strategy for watchdog, like only send probes if nothing was received over serial port for a certain period? I.e. if any traffic goes through, consider the connection as active. If nothing was received for more than a minute, then send the VR command.

This seems like a bug with the radio library. Perhaps there should be a global concurrency limit for requests? It should be possible to multiplex Zigbee requests along with normal radio traffic without completely locking up the XBee like this for 60+ seconds.

The specs don't mention such limit, at least I could not find it.

I've merged the PR to disable the watchdog for now as we can always re-enable it later.

I think #173 may be the proper fix for this problem but I don't have a real XBee network to test it with. If you have time to try it out some time in the future, let me know how it goes (and what concurrency limit allows you to control a lot of devices at once without any retrying from ZHA).