[BUG]

Question

[BUG]

wolfspirituk opened this issue 2 months ago · comments

Bug report

The clock crashes or shuts down irregularly every few hours. Seems that heavy MQTT messaging has an effect.

Describe the bug

For some reason it use to be regular at almost exactly 4 hours (I don't think I was causing that in code, but might have)
It can run for much longer.

The crash is "black screen" (as best I can tell). Pressing left and right causes it to restart - usually - but not always.

The reason the uptiem trace is a sawtooth from 30-31 May is that I had it on auto restart (via h/w switch). I've turned that off from 1 June just in case that was a causal factor (it doesn't seem like it is). But that is why there are gaps now.

UPtime and free RAM:

RAM isn't an issue - never goes below 105k (40k seems to be reported elsewhere as a problem)
I think I am only using MQTTs to operate the "clock". (I don't know if HASS does stuff in the b/g)

I can reasonably reliably crash the clock if I hit is with lots of MQTTs for new apps over a few seconds. Maybe 30-40 requests with a individual app delete payload ("") and then the new data. However it isn't guaranteed.

Additional information

Devices involved:
- Model: Ulanzi Awtrix Smart Pixel Clock 2882 (TC001)
- awtrix3 version: [ eg. v0.45 ]

To Reproduce

Turn it on - hit is with lots of MQTT messages - might cause it?
Sorry I realise how difficult that is to be useful!

Expected behavior

to run continuously

Logs

The HA logs will show the api loop call every minute. Until there is a :

2024-06-01 02:25:11.306 ERROR (MainThread) [homeassistant.components.rest.data] Error fetching data: http://awtrix_0b34f0.local/api/loop failed with All connection attempts failed
2024-06-01 02:25:11.306 DEBUG (MainThread) [homeassistant.components.rest] Finished fetching rest data data in 3.062 seconds (success: True)

Which then fills the log every minute until I reboot the Clock.

Additional context

It can die with only a couple of custom apps - or 15 - it is the number of messages that seems to be the impact not the number of current apps.

Stephan · Answer 1 · Sun Jun 02 2024 03:03:53 GMT+0800 (China Standard Time)

I successfully handled 6,000 app updates per minute across 20 custom applications without any issues. As you've already identified, the previous crashes and reboots were solely due to a memory leak, which does not appear to be the problem in your current situation.

To troubleshoot, I recommend disabling each automation and then re-enabling them one by one. This approach will help you pinpoint which specific automation or request is causing the reboots.

"I think I am only using MQTTs" is not true, because your HA log says youre also using the REST API for /api/loop.

Stephan · Answer 2 · Sun Jun 02 2024 03:09:47 GMT+0800 (China Standard Time)

another question.
You wrote:
"with a individual app delete payload ("") and then the new data"
why delete the app if you're going to resend it anyway? You can also overwrite it directly without deleting it first, which saves one call per update.

wolfspirituk · Answer 3 · Mon Jun 03 2024 01:08:10 GMT+0800 (China Standard Time)

I only delete the apps when I want to toggle them off or on in the display, normally they will just get refreshed when the data changes.
I have a dashboard that lets me show or hide the apps (data), turn on whether it is motion sensor activated or perm on (during the day). I can also reset the config - just in case it gets out of sync - but that also deals with a day of week app that only needs updating daily along with the colour of the calendar for work days, weekends and holidays.

Anyway - I've run it by powering up and then letting it just run on a normal cycle. That 4 hours before crashing is repetitive. Uptime in seconds until it crashes 14454, 14442, 14461, 14485, 14486, 14474, 14463, 14452, 14466, 14,475, 14477, 14465 seconds uptime before crash.
I have run longer e.g. 43397 - which is 3* 14465 (almost). I think there is a pattern! :-)
Those uptimes count for ca. 75% of ALL the times I've run this since I got it. Statistically that is way too significant to be random.

I will try to run this with a subset of the apps to see if it runs longer.

Hoping this might trigger a thought...

Lübbe Onken · Answer 4 · Wed Jun 05 2024 21:32:57 GMT+0800 (China Standard Time)

@wolfspirituk your dashboard looks fancy. How did you set this up?

wolfspirituk · Answer 5 · Thu Jun 06 2024 03:16:21 GMT+0800 (China Standard Time)

@wolfspirituk your dashboard looks fancy. How did you set this up?

Thanks. It is two Grid Cards. A 3 column for the top row and 4 column underneath for most of the toggles. Then populated with Custom:button-cards.

I'm quite pleased as it works well for what I do with it. I'm still learning and amazed how much people can squeeze out of HA - got to keep trying to get to their level! :-)

wolfspirituk · Answer 6 · Thu Jun 06 2024 03:19:08 GMT+0800 (China Standard Time)

@Blueforcer - This has now been stable for a couple of days - I guess that some of the messages when corrupted will cause issues. When I say "corrupted" I mean the rubbish payloads that my buggy code threw out (Jinja and YAML is new to me).
I had assumed that it would be filtered to avoid issues of invalid structures.
I'll update if I ever work out what is causing it.

Stephan · Answer 7 · Wed Jun 12 2024 04:25:20 GMT+0800 (China Standard Time)

Ok, please feel free to reopen this issue