Integration throws an exception and shuts down if panel returns an http error

Question

Integration throws an exception and shuts down if panel returns an http error

mbbush opened this issue 8 months ago · comments

Currently, every time the span panel is restarted, there is a period of time during startup when it responds to local api requests with non-successful http status codes. I think it's 502 or 503 or something. When this happens, the integration logs an exception, and then basically shuts down, doing nothing at all until homeassistant is restarted.

I think this is a problem with the way the async code is handled, and throwing an exception where it isn't appropriate to do so.

2023-11-27 01:36:30.958 DEBUG (MainThread) [custom_components.span_panel.span_panel_api] HTTP GET Attempt #2: http://192.168.42.124/api/v1/status
2023-11-27 01:36:34.030 DEBUG (MainThread) [custom_components.span_panel.span_panel_api] HTTP GET Attempt #3: http://192.168.42.124/api/v1/status
2023-11-27 01:36:35.081 DEBUG (MainThread) [custom_components.span_panel] Finished fetching span panel SN-TODO data in 22.555 seconds (success: F
alse)
2023-11-27 01:36:35.081 DEBUG (MainThread) [custom_components.span_panel.span_panel_api] HTTP GET Attempt #1: http://192.168.42.124/api/v1/status
2023-11-27 01:36:35.095 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 754, in _async_init_reauth
    await hass.config_entries.flow.async_init(
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 880, in async_init
    flow, result = await task
                   ^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 908, in _async_init
    result = await self._async_handle_step(flow, flow.init_step, data)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/data_entry_flow.py", line 389, in _async_handle_step
    result: FlowResult = await getattr(flow, method)(user_input)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/span_panel/config_flow.py", line 145, in async_step_reauth
    await self.setup_flow(TriggerFlowType.UPDATE_ENTRY, entry_data[CONF_HOST])
  File "/config/custom_components/span_panel/config_flow.py", line 75, in setup_flow
    panel_status = await span_api.get_status_data()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/span_panel/span_panel_api.py", line 58, in get_status_data
    response = await self.get_data(URL_STATUS)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/span_panel/span_panel_api.py", line 107, in get_data
    response = await self._async_fetch_with_retry(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/span_panel/span_panel_api.py", line 132, in _async_fetch_with_retry
    resp.raise_for_status()
  File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 749, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Server error '502 Bad Gateway' for url 'http://192.168.42.124/api/v1/status'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/502

The simplest fix is to just rip out the retry logic from the api calls, and let home assistant try again on the next polling interval.

Mat Ellis · Answer 1 · Sun Jan 07 2024 15:29:38 GMT+0800 (China Standard Time)

Are you using the door-bypass method, or do you have an authentication token? If you're using the door by-pass, it might be worth trying the integration with a token. If you would like to get one of these, do the door thing and then run this command:

curl -d '{"name": "<UNIQUE IDENTIFIER OF YOUR CHOICE>", "description": "<WHATEVER YOU LIKE>"}' -X POST http://<YOUR SPAN PANEL IP ADDRESS HERE>/api/v1/auth/register

It should give you something like (obfuscated):

{"accessToken":"<loooooong string of characters>","tokenType":"bearer","iatMs":122343853725}%

If this fixes your problem it doesn't close the issue but hopefully it might help those who are suffering from it until something is done.

Greg Gibeling · Answer 2 · Tue Jan 16 2024 04:47:08 GMT+0800 (China Standard Time)

Will look into this more over the next 1-3 weeks for the 0.0.9 release.

Bill Durr · Answer 3 · Thu Jan 18 2024 21:31:15 GMT+0800 (China Standard Time)

I am seeing this issue frequently as well

2024-01-17 21:10:06.265 ERROR (MainThread) [custom_components.span_panel] Authentication failed while fetching span panel SN-TODO data: Server error '502 Bad Gateway' for url 'http://10.47.3.24/api/v1/status'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/502

This makes the data pretty unreliable. Happened twice within the last 12 hours

Bill Durr · Answer 4 · Wed Jan 24 2024 23:41:52 GMT+0800 (China Standard Time)

I changed L135 of span_panel_api.py to except (httpx.TransportError, httpx.HTTPStatusError): and it seems to have helped with the issue. Not sure if its the correct solution. But it made it stable for now.

Greg Gibeling · Answer 5 · Thu Jan 25 2024 00:58:48 GMT+0800 (China Standard Time)

@mbbush original comment seems to be about startup issues.

@billyburly are you seeing 502s only at startup? The way you worded it, it sounds like you see 502s during normal operation. If that's true, then don't get me wrong I'll work on making the code more robust of course, but it worries me a little that something might be going on with your panel or network, too. Any chance you have an HTTP proxy in there somewhere? Or that your panel is having a problem of some kind?

Bill Durr · Answer 6 · Thu Jan 25 2024 01:37:48 GMT+0800 (China Standard Time)

@gdgib I'm seeing it during normal operation. It appears to happen randomly. Sometimes it will run for a few days, others for only a couple hours. Nothing weird in my network between the panel and HA. Only way to recover is reload the integration or restart HA.

Greg Gibeling · Answer 7 · Thu Jan 25 2024 01:44:28 GMT+0800 (China Standard Time)

That's definitely not something I would expect to be common, nor necessarily reproducible. Like I said, I'll 100% make the integration more robust against this kind of thing, but you might want to see if you can test/debug/monitor your panel. 502s are typically a sign of a server-side software issue (or a proxy, which is why I asked). In other words, I'm worried that while I can improve the integration to handle these errors, you're seeing errors from the panel, and that worries me.

I don't have enough info to state that for sure. Just a worry on my part that I wanted to pass on.

Anyway, as you can see this is on the 0.0.9 roadmap. But I'm pretty sure all I can do is improve the retry logic, which won't necessarily prevent the 502s, just deal with them more elegantly.

J. Eckert · Answer 8 · Thu Jan 25 2024 01:47:57 GMT+0800 (China Standard Time)

I've experienced a similar issue but admittedly have not done deep enough to know the source of the errors causing it. Sometimes once a day, sometimes six or seven times a day, sometimes not for several days days, all of my endpoints will come back unavailable and my integration will break until I restart the integration or home assistance.

I have bandaged this by setting up a trigger that every time they go unavailable it sends me an email and it restarts the integration, so while I don't have easy access to the log to see exactly what is causing it I do know how often it happens. What I'll do is tweak that integration to no longer automatically restart HA, to preserve the logs, and instead just fire off a notification so that I can go in the logs and pull what those errors are, to see if they can help us.

My take on this is we we probably don't need to figure out how to prevent those errors… But making it so that the integration doesn't go into a bad state without intervention and recover gracefully to continue polling after thr wave of whatever this is passes when they happen would be a QoL boost maybe.

J. Eckert · Answer 9 · Fri Jan 26 2024 07:31:00 GMT+0800 (China Standard Time)

So here's my log files from when it happened once each on the last two days:

Logger: custom_components.span_panel
Source: helpers/update_coordinator.py:353

Authentication failed while fetching span panel SN-TODO data: Server error '502 Bad Gateway' for url 'http://192.168.2.5/api/v1/status' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/502

It left the integration in a non-working state until I restarted the integration (or HA). In the integration page the Span integration was moved to the top of the screen in red in the "pending configuration" state as well.

Seems to me like some kind of graceful wait and try again later when this happens might save the headaches on this?

Greg Gibeling · Answer 10 · Fri Jan 26 2024 07:35:07 GMT+0800 (China Standard Time)

Agreed, of course. Thats' is pretty much the original suggestion here, and it's a good idea which I'll be implemented. I do worry about why you're seeing it and I am not, though. Seems like we'd lose some live power data during that wait, which isn't ideal. I'm trying to think of ways to figure out why the 502 is showing up at all so we can fix that, in addition to adding the retry.

J. Eckert · Answer 11 · Fri Jan 26 2024 07:39:37 GMT+0800 (China Standard Time)

Yeah, I agree... this is a weird one, and the inability to get down inside the system and learn more about why the Span is doing this in the first place doesn't help us any.

Back when I thought this was a unique to me problem, I just had an automation that restarted things the moment one of the sensors came back Unknown or Unavailable. Overall, I had a pretty reliable continuation of info... historical data is cumulative so it went on fine, it's only the live data that I lose a blip on logging to the logger in real time of a few seconds. Asyou said, though, that's not okay long term, though I know I am an anecdotal user... but for me personally I'm not really relying on that data much if at all though so it hasn't pained me yet at least, fortunately.

Greg Gibeling · Answer 12 · Fri Jan 26 2024 07:45:00 GMT+0800 (China Standard Time)

If I say things like "wireshark" and "mitm ssl proxy" is that something you have skill with? I'd be curious to get a long-term network capture that includes a 502, but I'm on the fence whether it's easier to ask you to dig into heavy duty network debugging tools (option A) or just add stuff to the logs in 0.0.9 (option B).

Option A reduces the number of releases, and guarantees the most info, though not success, but it could be a LOT of work on your part. Option B is way less effort. I'm inclined to start with Option B, which is no effort from you, and only do A if really necssary, since it will be a PITA. But if using wireshark with an SSL/TLS proxy is super easy for you, maybe Option A is good.

J. Eckert · Answer 13 · Fri Jan 26 2024 08:27:51 GMT+0800 (China Standard Time)

I am totally down for that.

HOWEVER... I am currently at my apt in LA for a week and accessing my home LAN as needed via VPN so it adds some layers of complexity for me. We could do option B maybe in a branch and I can pull it down to my secondary HA install I have specifically for testing and debugging, and load that as a custom integration on that one?

(Interestingly, about 50% of the time when one breaks, both breaks... the other half of the time only one breaks which makes me think the polling time on them has an overlap SOME times, but not every time, when whatever this is happens transiently)

I will say I have used wireshark in the past, only briefly, but am very familiar with what it is and does. I'm not adverse to crash coursing myself in a refresher into setting this up worst case.

Greg Gibeling · Answer 14 · Fri Jan 26 2024 09:12:52 GMT+0800 (China Standard Time)

Any chance you're running two instances of HA BOTH connecting to the panel at the same time? That might explain the issue, and that would suggest a possible fix.

J. Eckert · Answer 15 · Fri Jan 26 2024 09:16:07 GMT+0800 (China Standard Time)

I thought about that but from what I recall, these issues began months before I spun up the second instance for testing purposes. Additionally I've intentionally configured their polling intervals to be very awkwardly odd number offsets. I think my primary one polls every 15 seconds and the testing one is something like every 36?

I'm fairly certain I set all this up after the fact in anyway that should be non-interfering, but for the sake of clarity I'm just gonna shut down my secondary testing instance's span integration for a few days and see if this changes anything simply so there is no ambiguity.

Greg Gibeling · Answer 16 · Fri Jan 26 2024 09:35:19 GMT+0800 (China Standard Time)

Thank you for trying that, please let me know if it helps. In parallel I'll check a few things in the code to see if I can fix both the retry, and make sure the integration supports two instances if possible.

J. Eckert · Answer 17 · Fri Jan 26 2024 09:46:15 GMT+0800 (China Standard Time)

FWIW each instance of Home Assistant is using its own API key for the integration, and neither HA system is aware of the existence of the the other install as well (and each one is running on a VLAN isolated from the other), so from the Span panels point of view it should just be two completely different and unrelated things talking to its API. (if that context helps any)

Greg Gibeling · Answer 18 · Fri Jan 26 2024 09:48:47 GMT+0800 (China Standard Time)

It does. I would have probably had to ask about that.

FWIW, I'm planning to dig into this next week, and get 0.0.9 out with a few other minor things.

J. Eckert · Answer 19 · Fri Jan 26 2024 09:50:38 GMT+0800 (China Standard Time)

I spent a few years managing QA and first party certification teams back in the day at a few game studios, including EA. I take my QA efforts seriously. ;) lol

Greg Gibeling · Answer 20 · Fri Jan 26 2024 09:54:25 GMT+0800 (China Standard Time)

I knew I liked you for a good reason. I like people who actually have to face the users of software to some degree or another. They know the difference between what the software should do and what the customer wants it to do, and how to tell the two apart.

J. Eckert · Answer 21 · Fri Jan 26 2024 10:00:23 GMT+0800 (China Standard Time)

at my last job, before our layoffs in the winter, I was the VP of ecosystem… You'd better believe the product team did NOT like me and my teams constantly incessant needling about user experience… Lol

I was a real pain in the ass for a reason! 🤣

I'm a career DevRel, BizDev, QA, and DevOps person in tech. I am an engineering team's worst nightmare.

J. Eckert · Answer 22 · Sat Jan 27 2024 01:46:08 GMT+0800 (China Standard Time)

Okay can confirm this happens when I only have one instance running on the network. I shut down my testing environment entirely last night, but at 4:36 AM this morning I had another hiccup.

Greg Gibeling · Answer 23 · Fri Feb 09 2024 10:20:36 GMT+0800 (China Standard Time)

I'm putting together the 0.0.9 dev plan now. I'm going to try and both log a little more information in places, and also implement a better retry for the connections. I wish I could say my dev pace was going to be super fast, but I've had to work on issues in a few other integrations lately too.

cayossarian · Answer 24 · Sun Mar 24 2024 07:12:57 GMT+0800 (China Standard Time)

The issue is due to erroneous code in init that raises an authentication exception when in fact a 502 occurred. A pull request has been submitted.

J. Eckert · Answer 25 · Sun Mar 24 2024 07:17:00 GMT+0800 (China Standard Time)

That explains SO much, and also why I couldn't ever track down any correlating leads... because I was looking in all the wrong places and never thought to consider that possibility!

cayossarian · Answer 26 · Sun Jun 09 2024 01:12:13 GMT+0800 (China Standard Time)

@gdgib please create a new release with the merged fix as some users are reporting the underlying issue and then need to install a custom repository. Alternatively we can change the default HACs repository. Thanks!