elad-bar / ha-edgeos

Integration with EdgeOS (Ubiquiti)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

daily disconnects

MagieMalone opened this issue · comments

I get daily disconnects , reload integration fixes the issue.

running on v2.0.12

This error originated from a custom integration.

Logger: custom_components.edgeos.component.api.websocket
Source: custom_components/edgeos/component/api/websocket.py:133
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: October 18, 2022 at 19:21:02 (1 occurrences)
Last logged: October 18, 2022 at 19:21:02

Failed to connect WS, Error: 503, message='Invalid response status', url=URL('wss://192.168.3.1/ws/stats'), Line: 109
October 19, 2022
ROUTER Interface ETH0 Status was connected
08:46:50 - 14 minutes ago
ROUTER Interface ETH0 Status was connected
08:34:09 - 27 minutes ago
ROUTER Interface ETH0 Status became unavailable
08:33:59 - 27 minutes ago
October 18, 2022
ROUTER Interface ETH0 Status was connected
07:21:45 - Yesterday
ROUTER Interface ETH0 Status became unavailable
07:21:35 - Yesterday
October 17, 2022
ROUTER Interface ETH0 Status was connected
07:55:17 - 2 days ago

at the same time, ping integration to google dns keeps running during the events.

I'll add debug as well to see what causes this.

this happens due to service disconnections by the router, the component connectivity recovers from it after 30 seconds,
in old versions of component, that warning message was not written unless it failed 3 times in row, now it logs every time it happens.

it also happened for me twice over the last week, again, this is not related to the component (can happen also with the UI of EdgeOS, but again, it recovers from it as well)

btw, you don't need to reload as this is part of the recovery mechanism

everythings stays disconnected, until I reload. for instance eth0 graph stays at 0 for hours.
so this is not the only sy,mptom of the issue I face.

please provide debug log level for that, I will need to understand why it's not recovering (without additional warnings)

I will, once it happens again, so I expect tomorrow.

thanks

@poonab mentioned there is a log message of Update in progress, will skip the request which can cause it, please let me know if it happens to you as well, it can be a great lead

@poonab, are there any log messages with the following pattern before / after it happens (under the assumption log level is set to INFO / DEBUG):
WS Status changed to ***, API Status: ***?

released new version with improved connectivity, set the message to e displayed only when the connection was not created, in case of disconnection, based on the error,
in addition, fixed the ping message which could of cause it since it was not working.

image

data is missing again from the graph but in the log I see it is still being collected. Btw, the state in Developer tools is also not changing since from around 0:00, however not much around that time that would indicate an issue.

some logs:
https://pastebin.com/1AAKNK1Y

struggling to paste bigger log, as debug creates a TON :-)

the log you have sent is from 07:43:02, i cannot understand anything with log of one second :)
please upload the entire log begins before it stopped working

in addition, I added 3 new sensors of messages: received, ignored and error,
can you please check the received messages sensor's graph?

thanks

I don't have those new sensors yet, as I still ran on 0.12 when the issue occured. I will update to it now.

I will try getting to it by evening, with the latest version (v2.0.13) - enough to set the logs for info and above to understnad the issue

Happened again last night at around 0:00, currently on 2.0.14
home-assistant.log.gz

Is it happening at the same time every night?

the last 2 nights, yes, at midnight, and the night before at 19:20

Mine also failed again @ 21:54:40, not happening at the same time. Errors on HA are

This error originated from a custom integration.

Logger: homeassistant
Source: custom_components/edgeos/component/api/websocket.py:158
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: 22 October 2022 at 21:54:40 (1101 occurrences)
Last logged: 13:11:36

Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/config/custom_components/edgeos/component/managers/home_assistant.py", line 83, in async_send_heartbeat
    await self.ws.async_send_heartbeat()
  File "/config/custom_components/edgeos/component/api/websocket.py", line 158, in async_send_heartbeat
    await self._ws.send_str(data)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_ws.py", line 151, in send_str
    await self._writer.send(data, binary=False, compress=compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 690, in send
    await self._send_frame(message, WSMsgType.TEXT, compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 601, in _send_frame
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport

This error originated from a custom integration.

Logger: custom_components.edgeos.core.managers.home_assistant
Source: custom_components/edgeos/core/managers/home_assistant.py:244
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: 20 October 2022 at 04:12:41 (55 occurrences)
Last logged: 01:17:16

Update in progress, will skip the request

Skipping through the logs I cant see anything weird included but at 21:54 all the data that continues to be logged going forward is no longer updating the values

That is not the same error is reported in this issue, another issue, saw it as well and this one already has a fix

please check v2.0.15, should improve dramtically the stability, for me 3 days without a single warning / error

thanks

I upgraded to v2.0.16 yesterday, happened again last night. INFO level log doesn't show much, just that after 1s. past midnight no values change anymore.
192.168.3.9-20221029-095831.log

According to the log you have sent, there are updates up until the last log line:
2022-10-29 00:00:03.101 INFO (SyncWorker_30) [custom_components.edgeos.core.managers.entity_manager] ROUTER Interface ETH0 Received Traffic (sensor) updated, state: 527423484.475 | updated

in addition, i don't see a single disconnection within that log, please use debug log level to get more details for the same period, the fact it happens on dialy basis around the same time is weird, happens to me in the past because lots of load on the DNS server I ran (pi-hole), could it be that you are using HASS.IO / AdGuard / PiHole?

thanks

I enabled debug yesterday, included the log from 23:58-00:02
192.168.3.9-20221030-102935.zip

I run pi-hole but not on HA, it's on a seperate VM.

Will review the log file later today, thanks.

Additional direction, you might have the same issue I had around a year ago when the pihole was overloaded at specific time which caused many unexplained issues related to networking,
Please check how to increase the number of requests in parallel to the pihole.

I added some monitoring to esxi yesterday, it seems it's not busy on the cpu's at the time:
image
also not on HA:
image
though it is currious to see that it drop cpu cycles right at that time...

is the second screen shot is from the component of EdgeOS CPU?

thanks

no, overal HA processor usage. I'm putting an snmp probe in for the edgeos as we speak. and I'm automating the reloading for now at 00:30.

Can you share the screen shot of the EdgeOS CPU / Memory sensors?

Thanks

image

the busy cpu's on esxi don't happen until 00:45 and apparently my automation is not working (yet) :-)

I do get some errors in the log now:

Logger: homeassistant
Source: custom_components/edgeos/component/api/websocket.py:160
Integration: Ubiquiti EdgeOS Routers ([documentation](https://github.com/elad-bar/ha-edgeos), [issues](https://github.com/elad-bar/ha-edgeos/issues))
First occurred: 00:00:48 (491 occurrences)
Last logged: 06:49:09

Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/config/custom_components/edgeos/component/managers/home_assistant.py", line 88, in async_send_heartbeat
    await self.ws.async_send_heartbeat()
  File "/config/custom_components/edgeos/component/api/websocket.py", line 160, in async_send_heartbeat
    await self._ws.send_str(data)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_ws.py", line 151, in send_str
    await self._writer.send(data, binary=False, compress=compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 690, in send
    await self._send_frame(message, WSMsgType.TEXT, compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 601, in _send_frame
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport

If the issue resulted by the error above that's good, add validation for that error and reconnect at that point,
it's available in v2.0.20, please let me know if it solved the issue.

thanks

it seems the issue did not occur last night while I upgraded to v2.0.20, saw this in the log:

2022-11-02 00:00:00.667 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Handle interfaces data
2022-11-02 00:00:01.207 INFO (MainThread) [custom_components.edgeos.component.api.websocket] Message queue is empty, will try to resample in a second
2022-11-02 00:00:01.408 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Keep alive message sent
2022-11-02 00:00:01.408 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Keep alive data to be sent: 69{"CLIENT_PING": "", "SESSION_ID": "1c7a652b34a44b1fbc622f296284dc88"}
2022-11-02 00:00:01.408 WARNING (MainThread) [custom_components.edgeos.component.api.websocket] Connection reset error raise by the client, will try to reconnect soon, Error: Cannot write to closing transport
2022-11-02 00:00:01.411 INFO (MainThread) [custom_components.edgeos.component.managers.home_assistant] WS Status changed to Failed, API Status: Connected
2022-11-02 00:00:01.561 DEBUG (SyncWorker_22) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:02.211 INFO (MainThread) [custom_components.edgeos.component.api.websocket] Stop listening
2022-11-02 00:00:02.211 INFO (MainThread) [custom_components.edgeos.component.managers.home_assistant] WS Status changed to NotConnected, API Status: Connected
2022-11-02 00:00:02.562 DEBUG (SyncWorker_18) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:03.562 DEBUG (SyncWorker_19) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:04.563 DEBUG (SyncWorker_46) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:05.564 DEBUG (SyncWorker_28) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:06.566 DEBUG (SyncWorker_9) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:07.567 DEBUG (SyncWorker_26) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:08.568 DEBUG (SyncWorker_4) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:09.568 DEBUG (SyncWorker_6) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:10.569 DEBUG (SyncWorker_45) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:11.569 DEBUG (SyncWorker_0) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:12.570 DEBUG (SyncWorker_7) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:13.439 DEBUG (MainThread) [custom_components.edgeos.component.api.api] Loading sys_info data
2022-11-02 00:00:13.498 DEBUG (MainThread) [custom_components.edgeos.component.api.api] Loading dhcp_stats data
2022-11-02 00:00:13.572 DEBUG (SyncWorker_30) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:13.707 DEBUG (MainThread) [custom_components.edgeos.component.api.api] Loading dhcp-leases data
2022-11-02 00:00:13.945 DEBUG (MainThread) [custom_components.edgeos.component.managers.home_assistant] Extracting API Data
2022-11-02 00:00:14.572 DEBUG (SyncWorker_21) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:15.573 DEBUG (SyncWorker_24) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:16.432 WARNING (MainThread) [homeassistant.components.sensor] Updating snmp sensor took longer than the scheduled update interval 0:00:10
2022-11-02 00:00:16.433 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.edgeos_cpu_1 is taking over 10 seconds
2022-11-02 00:00:16.434 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.edgeos_cpu_2 is taking over 10 seconds
2022-11-02 00:00:16.434 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.edgeos_cpu_3 is taking over 10 seconds
2022-11-02 00:00:16.435 WARNING (MainThread) [homeassistant.helpers.entity] Update of sensor.edgeos_cpu_4 is taking over 10 seconds
2022-11-02 00:00:16.574 DEBUG (SyncWorker_28) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:17.574 DEBUG (SyncWorker_31) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:18.575 DEBUG (SyncWorker_18) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:19.575 DEBUG (SyncWorker_19) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:20.576 DEBUG (SyncWorker_30) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:21.577 DEBUG (SyncWorker_21) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:22.578 DEBUG (SyncWorker_46) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:23.579 DEBUG (SyncWorker_5) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:24.580 DEBUG (SyncWorker_47) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:25.580 DEBUG (SyncWorker_3) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:26.433 WARNING (MainThread) [homeassistant.components.sensor] Updating snmp sensor took longer than the scheduled update interval 0:00:10
2022-11-02 00:00:26.580 DEBUG (SyncWorker_37) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:27.581 DEBUG (SyncWorker_45) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:28.582 DEBUG (SyncWorker_2) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:29.583 DEBUG (SyncWorker_14) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:30.583 DEBUG (SyncWorker_12) [custom_components.edgeos.component.managers.home_assistant] Loading entities
2022-11-02 00:00:31.412 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Reinitializing WebSocket connection
2022-11-02 00:00:31.412 INFO (MainThread) [custom_components.edgeos.component.managers.home_assistant] WS Status changed to Connecting, API Status: Connected
2022-11-02 00:00:31.526 INFO (MainThread) [custom_components.edgeos.component.managers.home_assistant] WS Status changed to Connected, API Status: Connected
2022-11-02 00:00:31.526 INFO (MainThread) [custom_components.edgeos.component.api.websocket] Starting to listen connected
2022-11-02 00:00:31.526 DEBUG (MainThread) [custom_components.edgeos.component.api.websocket] Subscription data to be sent: 164{"SUBSCRIBE":[{"name":"export"},{"name":"interfaces"},{"name":"system-stats"},{"name":"discover"}],"UNSUBSCRIBE":[],"SESSION_ID":"1c7a652b34a44b1fbc622f296284dc88"}
2022-11-02 00:00:31.527 INFO (MainThread) [custom_components.edgeos.component.api.websocket] Subscribed to WS payloads
2022-11-02 00:00:31.584 DEBUG (SyncWorker_8) [custom_components.edgeos.component.managers.home_assistant] Loading entities

so fingers crossed :-)

really glad to see that it might got solved with latest version,
in general I see within the last log you have sent that around the time the WS connection got disconnected from EdgeOS, the SNMP sensors you have didn't manage to get response from the EdgeOS as well,
meaning, as I suspected before, you have an overload on the network around that time, which can be resulted by a load on the PiHole.

PiHole load is not bound to CPU, disk or memory, it's a configuration that limits the number of requests to the DNS server, usually 1k requests per minute,
Usually in rich IoT environment around 12:00 AM there are lots of requests running which can lead to that kind of problem, anyway, the solution I have added is for every ping being sent (every 55 seconds) there is a validation for connectivity, if got disconnected, the integration will try to reconnect.

thanks for the details and please update how it works in the upcoming few days before closing this issue.

well, the snmp got disconnected because I forgot to save the config, and I had an unrelated crash of the edgeos router :-) I will let you know how the comming days go.

Too bad, I had a failure again last night again.
so, my ping to google keeps running:
image

however, the connection to the router fails to connect for some reason, even though the traffic to google is using exactly the same route:

This error originated from a custom integration.

Logger: custom_components.edgeos.component.api.api
Source: custom_components/edgeos/component/api/api.py:319
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: 00:00:03 (1 occurrences)
Last logged: 00:00:03

Request failed, URL: https://192.168.3.1/api/edge/get.json, Error: Cannot connect to host 192.168.3.1:443 ssl:False [Connect call failed ('192.168.3.1', 443)], Line: 194, Retry attempt #3

as you can see from the error, I don't use DNS for the connection to the router, so I don't think it's pihole.

cpu's of both router and esxi are not high, and at least the automated reload is working now.
image

This error occurred because the router web API (not WebSockets persistent connection) was inaccessible, nothing the integration can do with it,
As you can see, it happened once for 3 attempts and continue regularly in the next attempt.

This issue is not related to the integration but to the router it self, CPU, RAM and network traffic doesn't matter for that case as the device HTTP server is the one that not responding (Overload or restart of the service) at the same time on daily basis.

BTW if you are using the IP instead of Hostname it rules out the DNS server.

This night there was no disconnect.

no disconnects so far, seems this one is solved.