elad-bar / ha-edgeos

Integration with EdgeOS (Ubiquiti)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ha-edgeos loses ssh connection to edgerouter when it reboots, stops updating statistics

Dude4Linux opened this issue · comments

I've determined that the EdgeOS integration loses the ssh ssl connection to the edgerouter and stops updating status and statistics until the HA server is rebooted. It would be helpful to detect the loss of connection and automatically restart it without needing human intervention.
The relevent logs:

2023-04-29 21:33:45.945 WARNING (MainThread) [NetCat Monitor] Internet connection lost
2023-04-29 21:34:49.641 WARNING (MainThread) [custom_components.edgeos.component.api.api] Request failed, Endpoint: set, Error: set, Line: 280
2023-04-29 21:34:58.972 ERROR (MainThread) [custom_components.edgeos.component.api.api] Failed to set state of interface eth0 to False
2023-04-29 21:35:09.045 WARNING (MainThread) [custom_components.edgeos.component.api.api] Request failed, Endpoint: delete, Error: delete, Line: 280
2023-04-29 21:35:09.051 ERROR (MainThread) [custom_components.edgeos.component.api.api] Failed to set state of interface eth0 to True
2023-04-29 21:40:30.941 WARNING (MainThread) [custom_components.edgeos.component.api.api] Request failed, URL: https://10.10.0.1/api/edge/get.json, Error: Cannot connect to host 10.10.0.1:443 ssl:False [Connect call failed ('10.10.0.1', 443)], Line: 226, Retry attempt #3
2023-04-29 21:40:37.087 WARNING (MainThread) [custom_components.edgeos.component.api.api] Request failed, URL: https://10.10.0.1/api/edge/get.json, Error: Cannot connect to host 10.10.0.1:443 ssl:False [Connect call failed ('10.10.0.1', 443)], Line: 226, Retry attempt #3
2023-04-29 21:40:46.301 WARNING (MainThread) [custom_components.edgeos.component.api.api] Request failed, URL: https://10.10.0.1/api/edge/get.json, Error: Cannot connect to host 10.10.0.1:443 ssl:False [Connect call failed ('10.10.0.1', 443)], Line: 226, Retry attempt #3
2023-04-29 21:41:13.947 WARNING (MainThread) [custom_components.edgeos.component.api.api] Request failed, URL: https://10.10.0.1/api/edge/get.json, Error: Cannot connect to host 10.10.0.1:443 ssl:False [Connect call failed ('10.10.0.1', 443)], Line: 226, Retry attempt #3
2023-04-29 21:41:23.165 WARNING (MainThread) [custom_components.edgeos.component.api.api] Request failed, URL: https://10.10.0.1/api/edge/get.json, Error: Cannot connect to host 10.10.0.1:443 ssl:False [Connect call failed ('10.10.0.1', 443)], Line: 226, Retry attempt #3
2023-04-29 21:41:31.929 WARNING (MainThread) [custom_components.edgeos.component.api.api] Request failed, URL: https://10.10.0.1/api/edge/get.json, Error: Cannot connect to host 10.10.0.1:443 ssl:False [Connect call failed ('10.10.0.1', 443)], Line: 226, Retry attempt #3
2023-04-30 03:34:11.108 WARNING (MainThread) [NetCat Monitor] Internet connection lost

[NetCat Monitor] is a script I wrote using netcat to monitor the Internet connection. At 2023-04-29 21:33:45.945, the script has detected that the Internet connection has been lost for 10 minutes and that it has started rebooting the cable modem and router, a process that can take up to ten minutes. After about one minute, ha-edgeos begins reporting errors on the state of interface eth0 (the Internet interface). After six attempts to connect to the router ('10.10.0.1', 443), the edgeos.component apparently gives up and no longer attempts to update status or statistics. The next day at 2023-04-30 03:34:11.108, the Internet connection was lost again. No further activity by ha-edgeos was logged until I rebooted the HA server about a week later.

FWIW, I was looking to use the ha-edgeos integration to reboot the router instead of having to rely a smartplug to cycle power.

Component is not using SSH, I need to check the code but it does have recovery, as you can see in logs, retry 3 times meaning there was 2 additional attempts to reconnect but the router was inaccessible

to better understand your issue, I will need diagnostic details once you facing that issue, in addition, I will need logs in debug level since the disconnections for ~10 minutes,
what I'm missing here is disconnection of the WS connection, which most probably happend before the internet disconnection, if that's the situation the process should of recover, if not, I don't understand how it worked (WS kept connection)

@elad-bar Sorry, I meant ssl. In the attached log (gzipped file) I have debug level turned on. At 10:06 I turned off the power to the edgerouter and waited for ten minutes before turning it back on. After awhile it gives up and stops updating because the session has been lost. AFAIK, HA has to be restarted to establish a new working session.

edgeos_2023-05-09T15-42-36.032Z.log.gz

Will review the log later this week, is HA still functioning during the time internet is down and after router restart or you must restart HA in general?

Yes, HA continues to run throughout the outage. I'm running on a Raspberry Pi4. I did note in the test that since it is connected to one of the ports of the EdgeRouter, it does lose connection to the rest of my network except for IoT which uses the WiFi vlan.

FWIW, a bit of background on my situation. I'm currently fighting an intermittent connection problem with my ISP. It has reached the finger-pointing stage. ISP says their cable modem is fine, must be my EdgeRouter-X. I can find nothing wrong with my router and think the problem must be the ISP. The solution has always been to power cycle the cable modem first, then the EdgeRouter. Both must be cycled in the right sequence and timing which can result in the EdgeRouter being down for six minutes. I wrote a script to monitor the Internet connection and an HA automation to do the power cycling. I installed the EdgeOS integration hoping to be able to reboot the EdgeRouter without needing to use a smart-plug for the power-cycle. I still haven't figured out how to reboot or shutdown the router via EdgeOS integration.

hi, sorry for the long wait, integration doesn't support currently reboot, it will also be kind of complicated to test that functionality, I would suggest turning off the port and then on.
ISP tends to blame private equipment instead of their own faults in the system, having the router reboot that frequent might kill the flash memory

I found that it was easier to implement the firewall reboot feature using an HA shell command. First you have to add the 'homeassist' user to the edgerouter and enable password-less login using an ssh key pair generated by ssh-keygen.

Add to /homeassistant/configuration.yaml

# shell command to reboot EdgeRouter
shell_command:
  reboot_edgerouter: "ssh -p 2222 -i /config/.ssh/id_ed25519 -o UserKnownHostsFile=/config/.ssh/known_hosts -qt homeassist@10.10.0.1 'sudo reboot'"

input_button:
  reboot_firewall:
    name: Reboot Firewall
    icon: mdi:wall-fire

reboot_edgerouter can now be called from an automation or by clicking on a 'Reboot Firewall' button added to a customized card.

commented

@Dude4Linux do you have your internet monitoring script anywhere?

You can use speedtest or ping integration to do that kind of work for you

@Dude4Linux do you have your internet monitoring script anywhere?

I put the following bash script in /homeassistant/bash_scripts/check_connect where it will be safe from upgrades, backup/restore, etc.

#!/bin/bash
# Script to test connectivity against multiple hosts using NetCat
# Run tests in parallel using bash background and wait

#Connection timeout
timeout=10

# examples of hosts_port addresses
hosts_port=("8.8.8.8:53" "1.1.1.1:53" "8.8.4.4:53" "1.0.0.1:53")
#hosts_port=("8.8.8.8 53" "1.1.1.1 53" "8.8.4.4 53" "1.0.0.1 53")
#hosts_port=("google.com:80" "cisco.com:80")

# all hosts fail test
#hosts_port=("8.8.8.8:54" "1.1.1.1:54" "8.8.4.4:54" "1.0.0.1:54")

# connect timeout host port
connect() {
  /usr/bin/nc -z -w $1 $2 $3
}

size=${#hosts_port[*]}
# Default status is failed
status="off"

# Launch connection attempts in background
for (( i=0; i<$size; i++ )); do
  current=${hosts_port[i]/:/ }     # Remove colons replacing with spaces
  connect $timeout $current &      # Put connection in background
done

# If any connection succeeds, return status 0
while
  wait -n
  [ "$?" -eq "0" ]
do  
  status="on"
done

echo $status

The script uses NetCat (nc) to open a TCP connection instead of ping (ICMP) because it is less likely to be blocked or dropped along the way. I chose to test against four different, well known, highly reliable hosts, i.e. the DNS servers from Google and CloudFlare. The tests are launched against all four hosts at once and if any one of them responds, the connection is considered to be "up". This protects against temporary outages at the remote networks.

Add the following to /homeassistant/configuration.yaml to launch the script once per minute.

# is the Internet up?    
command_line:
  - binary_sensor:
      command: '/homeassistant/bash_scripts/check_connect'
      command_timeout: 59
      name: "NetCat Monitor"
      device_class: connectivity
      payload_on: "on"
      payload_off: "off"
      scan_interval: 60

This has been working well for me for over a year and has survived multiple upgrades and migration to a new server.
You can check history of "NetCat Monitor" to see the times and durations of any outages.

I'm running uptime kuma, it allows monitoring internal / external connections, in addition i'm using speedtest exporter and monitor connection speed and ping,
Once it gets to specific threshold i'm getting notification, last monitoring is uptime robot to monitor connectivity from the world into HA,
As you can see, everything is monitored and audited, no special scripts, no dependency on specific hardware