eifinger / hass-foldingathomecontrol

Homeassistant integration for FoldingAtHomeControl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Connection after restart

TargetCrafter opened this issue · comments

Version of the custom_component

v0.0.1

I had the integration working yesterday, but i shutdown the pc overnight, and now it's showing a status from 15 hours ago. Even after restarting HA, still no result.

Thank you for opening an issue.
I will look into it.

I just changed the configuration to only use 8 cores on my cpu, and the sensor in HA changed accordingly. So it seems like there is still a connection between the PI and my pc, it's just not updating the status.

After some more digging, it looks like the addon is adding a new sensor every time you change a setting like amount of cores used. I first change the setting to 4 cores, after wich i changed it to 8 cores, and in HA i have 3 CPU sensors now: cpu:15, cpu:4 and cpu:8.

Not that i'm changing the setting every day, but i'd be nice if this would be cleaned up to prevent having loads of useless sensors.

Thanks for the feedback. I created a new Issue to solve that.
Cant promise when its done.

As for your initial problem. Are you sure that the PC was working on a workunit and thus the state should have changed?

Yes, the pc was running a calculation at that time, so it should've changed from paused to running.

It's also just showing a status from 14 hours ago, eventough i just ran a task this morning:
image

I could reproduce your error and have a fix.
I still struggle to produce a unit test^^

Almost same here. After PC reboot (or just the F@H server), HA does not reconnect and shows old values. But in my case, rebooting HA fix the problem.

Almost same here. After PC reboot (or just the F@H server), HA does not reconnect and shows old values. But in my case, rebooting HA fix the problem.

Yes, rebooting HA fixes the problem, but i don't want to restart HA every time i restart my computer..

I could reproduce your error and have a fix.
I still struggle to produce a unit test^^

If you need help testing, i'm willing to help.

I Just published v0.1.0.
This should fix the reconnection issue.

Seems to be working now, will monitor over the next couple of days.

Thanks @eifinger

Even here seems working well, thank you. I would suggest, in case of lost connection, to reset counters (ETA, next attempt, etc.) and change current state to something else, meybe OFFLINE, instead staying stock to last reading (usually STOPPING)

After a while, connection is stuck but I don't know why. I tried to reboot the server and seemed working well, but now is stuck...

Again, rebooted the PC the connection didn't be enstablish again

Do you see anything in the logs?

No, nothing in the log. You can reproduce the issue detaching the cable from the PC.

Just had the same thing, after a pc restart both sensors are stuck in the PAUSED state.
Just like @JanosAudron nothing in the logs.

Guess I'll be pulling some Ethernet cables tonight^^

Thank you for providing feedback!

Found the bug. Fix is on its way.

v0.1.1 is out and should finally fix this!

Hate to be that guy, but...
Just restarted my pc, and it is stuck on paused again..
After restarting HA it picks up the correct status again.

@TargetCrafter do you use a password for the FoH Client on the PC that you restarted?

I did, but now that you talked about it, i tried to see if the issue still persists without a password, which is the case unfortunately..

It seems there is a race condition somewhere. I will keep investigating!

While I am trying to debug this, could you please include the following log configurations:

Either as data to the service call logger.set_level:

custom_components.foldingathomecontrol: debug
FoldingAtHomeControl: debug

Or in configuration.yaml:

logger:
  default: warning
  logs:
    custom_components.foldingathomecontrol: debug
    FoldingAtHomeControl: debug

Just released v0.1.2.
I tested it with two of my pc, pulling cables, turning off WIFI and quitting the FoH client.

I so hope this works for everyone now!

I installed the update as soon as possible, but just now received a task. Unfortunately this update didn't show up until a restart of HA.

I added the provided settings to my configuration.yaml file, where should i see the logs for this?
i checked the logbook and dev tools logs, but saw nothing.

I'm starting to think i'm doing something wrong here..

You should see them in the dev logs. IF there are errors.

For me it's working without errors with the latest version.

One of my PC has been connected with a different IP (changed from wired to wireless connection) for 5-6 hours. Once reconnected with old IP, Home Assistant should recconnect itself but didn't.
From log, I have this:

2020-04-12 18:33:35 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330
2020-04-12 18:33:48 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330
2020-04-12 18:34:01 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330
2020-04-12 18:34:14 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Could not connect to 192.168.0.128:36330
2020-04-12 18:34:29 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Timeout while trying to connect to 192.168.0.128:36330

Attached at the present, foll log. More or less I reconnected the PC at "2020-04-12 18:34:29", since then no more data in the log, now is 20:07
FAH Log.txt

@JanosAudron Looks more like a problem on your PC than a problem with this addon. Maybe you didn't add the firewall rule to the profile you're using for you wifi connection?

I have 4 sensors, 1 CPU and 1 GPU on two PCs. I don't think is firewall fault, for two reason. First is, after HA reboot sensors started to work again (see the pictures attached). Second is because, after PC reconnection, we do not see any more logs into the file, like the addon got stock for any reason.

home-assistant.log
After reboot CPU
After reboot GPU
Before reboot CPU
Before reboot GPU

No more logs means it connected successfully again. What version of the integration are you on?

No more logs means it connected successfully again.

I do not agree because at every reconnection you can find a welcome message in the log, but was not the case.
DEBUG (MainThread) [FoldingAtHomeControl.serialconnection] Received welcome message: Welcome to the Folding@home Client command server.

I'm using version v.0.1.4

@JanosAudron I suspect a race condition is the root cause of your error.

I set the is_connected flag of the serialconnection after the socket was opened but before the welcome message was received.

If the timeout for the whole connection is reached just before the welcome message was fully received, the operation is cancelled but the flag is_connected is still true leading to a wrong state.

I fixed this behavior. It will be included in the next release.

However I cannot be sure that this is really what happened here.

If I reboot HASS, I get a disconnect, and need to reinitiate the custom component still :(

@baronKanon could explain a little bit further?

When you restart Homeassistant, the sensor shows unavailable ?
What do your logs tell you when being set as explained here?

@eifinger The sensor is unavailable after Home assistant restart, and the debug shows timeout error for connection to the client (on my PC). I usually have some timeouts for connections to external services during startup of HASS, as my RPi 3b+ has some stuff to set up. 20 minutes after initialization of the component, still no updating from the computer available.

I would like to add that I rename the entity ID to something easy to read, but no new entity is created. I guess this initial setup of the socket is also the culprit for not updating after my computer has been turned off for the night.

Are there any heartbeat to the client setup (if no updates for x amount of time - retry socket or something)? How many retries and with what frequency are they run during startup (so the system has the possibility to get up and running before the connection can be established)?

@baronKanon that is expected behavior.
If during startup the connection attempt times out it is assumed that the endpoint is not reachable and is disabled.

However I do see your point in changing this.

Could you please open another issue for that?

I paused client more or less at 10.11am. After 1 hour and half, at 11.46 (read on FoldingAtHome log) I unpaused it, but at that time addon reported a connection timeout. Actually, for 10-20 seconds even FAH client lost the connection as was reporting "updating"
Now, calcutation is running but addon is reporting unavailable.

Log from addon:

2020-04-18 17:06:05 WARNING (MainThread) [homeassistant.loader] You are using a custom integration for foldingathomecontrol which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant.
2020-04-18 17:06:16 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback
2020-04-18 17:06:16 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback
2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.serialconnection] Received welcome message: Welcome to the FAHClient command server.
2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback
2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Registered callback
2020-04-18 17:06:18 DEBUG (MainThread) [FoldingAtHomeControl.serialconnection] Received welcome message: Welcome to the Folding@home Client command server.

2020-04-19 11:46:15 ERROR (MainThread) [FoldingAtHomeControl.serialconnection] Timeout while trying to read from 192.168.0.1:36330

2020-04-19 11:46:17 ERROR (MainThread) [custom_components.foldingathomecontrol] Disconnected from 192.168.0.1:36330. Trying to reconnect.
2020-04-19 11:46:24 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Timeout while trying to connect to 192.168.0.1:36330
2020-04-19 11:46:24 ERROR (MainThread) [FoldingAtHomeControl.foldingathomecontrol] Timeout while trying to connect to 192.168.0.1:36330

Log from FoldingAtHome (please consider 2h time shift)

07:29:21:WU00:FS01:0x22:Completed 1760000 out of 2000000 steps (88%)
07:33:39:WU00:FS01:0x22:Completed 1780000 out of 2000000 steps (89%)
07:37:56:WU00:FS01:0x22:Completed 1800000 out of 2000000 steps (90%)
07:37:58:WU02:FS01:Connecting to 65.254.110.245:80
07:37:59:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
07:37:59:WU02:FS01:Connecting to 18.218.241.186:80
07:37:59:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
07:37:59:WU02:FS01:Connecting to 65.254.110.245:80
07:38:00:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
07:38:00:WU02:FS01:Connecting to 18.218.241.186:80
07:38:00:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
07:38:00:ERROR:WU02:FS01:Exception: Could not get an assignment
07:38:01:WU02:FS01:Connecting to 65.254.110.245:80
07:38:01:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
07:38:01:WU02:FS01:Connecting to 18.218.241.186:80
07:38:02:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
07:38:02:WU02:FS01:Connecting to 65.254.110.245:80
07:38:02:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
07:38:02:WU02:FS01:Connecting to 18.218.241.186:80
07:38:03:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
07:38:03:ERROR:WU02:FS01:Exception: Could not get an assignment
07:39:01:WU02:FS01:Connecting to 65.254.110.245:80
07:39:01:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
07:39:02:WU02:FS01:Connecting to 18.218.241.186:80
07:39:02:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
07:39:02:WU02:FS01:Connecting to 65.254.110.245:80
07:39:03:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:80': No WUs available for this configuration
07:39:03:WU02:FS01:Connecting to 18.218.241.186:80
07:39:03:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
07:39:03:ERROR:WU02:FS01:Exception: Could not get an assignment
07:40:38:WU02:FS01:Connecting to 65.254.110.245:80
07:40:39:WU02:FS01:Assigned to work server 13.82.98.119
07:40:39:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GP106 [GeForce GTX 1060 3GB] 3935 from 13.82.98.119
07:40:39:WU02:FS01:Connecting to 13.82.98.119:8080
******************************* Date: 2020-04-19 *******************************
07:41:25:WU02:FS01:Downloading 22.03MiB
07:41:31:WU02:FS01:Download 10.50%
07:41:37:WU02:FS01:Download 22.98%
07:41:43:WU02:FS01:Download 35.75%
07:41:49:WU02:FS01:Download 48.24%
07:41:55:WU02:FS01:Download 60.72%
07:42:01:WU02:FS01:Download 73.49%
07:42:07:WU02:FS01:Download 86.26%
07:42:13:WU02:FS01:Download 98.75%
07:42:13:WU02:FS01:Download complete
07:42:13:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:14545 run:0 clone:653 gen:12 core:0x22 unit:0x000000170d5262775e80e224425749f3
07:42:19:WU00:FS01:0x22:Completed 1820000 out of 2000000 steps (91%)
07:46:36:WU00:FS01:0x22:Completed 1840000 out of 2000000 steps (92%)
07:50:53:WU00:FS01:0x22:Completed 1860000 out of 2000000 steps (93%)
07:55:13:WU00:FS01:0x22:Completed 1880000 out of 2000000 steps (94%)
07:59:31:WU00:FS01:0x22:Completed 1900000 out of 2000000 steps (95%)
08:03:50:WU00:FS01:0x22:Completed 1920000 out of 2000000 steps (96%)
08:08:18:WU00:FS01:0x22:Completed 1940000 out of 2000000 steps (97%)

08:11:27:FS01:Paused

08:11:27:FS01:Shutting core down
08:11:28:WU00:FS01:0x22:WARNING:Console control signal 1 on PID 18536
08:11:28:WU00:FS01:0x22:Lost lifeline PID 18800, exiting
08:11:28:WU00:FS01:0x22:Exiting, please wait. . .
08:11:28:WU00:FS01:0x22:Folding@home Core Shutdown: INTERRUPTED
08:11:28:WU00:FS01:FahCore returned: INTERRUPTED (102 = 0x66)

09:46:04:FS01:Unpaused

09:46:04:WU00:FS01:Starting
09:46:04:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" E:\FoldingHome\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 6468 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
09:46:04:WU00:FS01:Started FahCore on PID 18152
09:46:08:WU00:FS01:Core PID:21408
09:46:08:WU00:FS01:FahCore 0x22 started
09:46:10:WU00:FS01:0x22:*********************** Log Started 2020-04-19T09:46:10Z ***********************
09:46:10:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
09:46:10:WU00:FS01:0x22: Type: 0x22
09:46:10:WU00:FS01:0x22: Core: Core22
09:46:10:WU00:FS01:0x22: Website: https://foldingathome.org/
09:46:10:WU00:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
09:46:10:WU00:FS01:0x22: Author: John Chodera john.chodera@choderalab.org and Rafal Wiewiora
09:46:10:WU00:FS01:0x22: rafal.wiewiora@choderalab.org
09:46:10:WU00:FS01:0x22: Args: -dir 00 -suffix 01 -version 706 -lifeline 18152 -checkpoint 15
09:46:10:WU00:FS01:0x22: -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
09:46:10:WU00:FS01:0x22: 0 -gpu 0
09:46:10:WU00:FS01:0x22: Config:
09:46:10:WU00:FS01:0x22:************************************ Build *************************************
09:46:10:WU00:FS01:0x22: Version: 0.0.2
09:46:10:WU00:FS01:0x22: Date: Dec 6 2019
09:46:10:WU00:FS01:0x22: Time: 21:30:31
09:46:10:WU00:FS01:0x22: Repository: Git
09:46:10:WU00:FS01:0x22: Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
09:46:10:WU00:FS01:0x22: Branch: HEAD
09:46:10:WU00:FS01:0x22: Compiler: Visual C++ 2008
09:46:10:WU00:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:46:10:WU00:FS01:0x22: Platform: win32 10
09:46:10:WU00:FS01:0x22: Bits: 64
09:46:10:WU00:FS01:0x22: Mode: Release
09:46:10:WU00:FS01:0x22:************************************ System ************************************
09:46:10:WU00:FS01:0x22: CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
09:46:10:WU00:FS01:0x22: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
09:46:10:WU00:FS01:0x22: CPUs: 4
09:46:10:WU00:FS01:0x22: Memory: 7.94GiB
09:46:10:WU00:FS01:0x22:Free Memory: 3.08GiB
09:46:10:WU00:FS01:0x22: Threads: WINDOWS_THREADS
09:46:10:WU00:FS01:0x22: OS Version: 6.2
09:46:10:WU00:FS01:0x22:Has Battery: true
09:46:10:WU00:FS01:0x22: On Battery: false
09:46:10:WU00:FS01:0x22: UTC Offset: 2
09:46:10:WU00:FS01:0x22: PID: 21408
09:46:10:WU00:FS01:0x22: CWD: E:\FoldingHome\work
09:46:10:WU00:FS01:0x22: OS: Windows 10 Pro
09:46:10:WU00:FS01:0x22: OS Arch: AMD64
09:46:10:WU00:FS01:0x22:********************************************************************************
09:46:10:WU00:FS01:0x22:Project: 14432 (Run 0, Clone 1394, Gen 11)
09:46:10:WU00:FS01:0x22:Unit: 0x0000000e0d5262775e8b4d5706ef330d
09:46:10:WU00:FS01:0x22:Digital signatures verified
09:46:10:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
09:46:10:WU00:FS01:0x22:Version 0.0.2
09:46:10:WU00:FS01:0x22: Found a checkpoint file
09:47:23:WARNING:WU00:FS01:Detected clock skew (1 mins 13 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
09:47:31:WU00:FS01:0x22:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
09:51:56:WU00:FS01:0x22:Completed 1820000 out of 2000000 steps (91%)

Thank you for further logs.
I am actually making progress with the next release but stumbled into some async issues.

I currently have a lot to do at work and need to give my mind something else to do than programming.

I plan to continue work on this next weekend.

After the latest update, the issue still exists..

I just released v1.0.0.

Feel free to report back if this fixes this issue finally =)

BREAKING CHANGE
Please remove all entries for this integration, update and then readd them.
BREAKING CHANGE

It seems like the problem has been solved. After restarting my computer, the status changed automatically. I'll try again in a couple of days.

Good job!

@TargetCrafter thank you for testing. I will close the issue. If the problem appear again feel free to reopen it.

Sensors are stuck again, is there a time limit for it to reconnect? Right now i've not been folding for 1 hours, but when i tried it yesterday by quickly restarting my pc it did work.

  1. On which value are your sensors stuck? The latest or unknown
  2. Which pc did you restart? The folding pc or homeassistant pc?`
  3. You have been folding for 1 hour now, but the sensors do not show current data?
  4. Do you have any log messages?
  5. If you changed the options, what is your update rate and read timeout?
  1. The sensors are stuck on pause, this was their last known state.
  2. i restarted the FaH pc.
  3. been folding for 3 hours now, still stuck at their last known state from 16 hours ago.
  4. don't know where.
  5. didn't change them, 5 and 15 respectively.

You can go directly to the url <your-ha-url>/developer-tools/logs or Go via the UI on the bottom left the Hammer Icon "Developer Tools" and then the tab "Logs"

There are no errors there regarding this addon.

Unless you turned off all error logging there has to be an error from when you turned off your FaH Pc that the connection got lost.

I could implement a "reset_connection" service however.

Going over the code I got an idea, could you change the update rate and see if that resets the connection?

Changing the update rate doesn't fix it, but a restart does.
I guess the FaH client doesn't send out a ping when it changes state, and you just have to read the data. After a long time of no change you'd likely stop asking for new data to reduce the amount of requests.

if so, i'd suggest to slow down update rate to once every 15 minutes instead of stopping requests all together.

When a new connection to the FaH client is made I tell it to send me updates every <update_rate> seconds. When I don't receive anything for <read_timeout> seconds I kill and reestablish the connection.
Changing the update rate will send information to the FaH client, if it can't reach it it should error out and reconnect. I cannot wrap my head around what might went wrong over there.

And i'm guessing all the requests go via IP, not DNS? I have adguard running on my network, so that might be a problem..

Jep, connects to the IP/hostname you provided when setting up the integration.

There hasn't been any updates on this issue. If your issue still persists feel free to reopen. Thank you!

Skimming through this and I have the same issue;

When the VM that is running Fold@home is restarted , my home assistant fold@home values are frozen and do not update. When the VM running Fold@home comes back online, home assistant fold@home values stay the same and are not updated

#I could implement a "reset_connection" service however.

This would be great, I could have an automation in Home Assistant that just runs every hour, as long as the service doesn't hurt the connection if it is established