Single ESP box appears to be getting duplicated

Question

Single ESP box appears to be getting duplicated

nikito opened this issue a year ago · comments

When flashing the WAS willow build onto the ESP box, I am seeing that it is getting duplicated in WAS:

Additionally seems to be issues in the log, and the unit is stuck in a loop. I'm attaching a log export for reference from the ESP box with full debug enabled. 😃
willow-was-log.txt

Note if this is due to things not yet being ready for WAS please let me know, wasn't sure if I was testing prematurely 😆

Nick Bento · Answer 1 · Mon Jun 26 2023 23:29:02 GMT+0800 (China Standard Time)

Looking closer at the logs, I'm thinking something goes wrong when it attempts to write the config? The Core Panic seems to happen right after it reads the config from WAS, so seems likely?

Nick Bento · Answer 2 · Tue Jun 27 2023 00:13:25 GMT+0800 (China Standard Time)

Did some debug/testing, I think the issue is that when write_config is called it tries to call deinit_audio, and that in turn is trying to call audio_thread_cleanup(hdl_at). However at this point the audio thread hasn't been initialized yet, so this method will get a null pointer because hdl_at hasn't been initialized yet. As a test I commented out the deinit_audio method in config_write, and everything works fine now (well, relatively speaking, as I think we still want to deinit audio on updates 😆 ). This also got rid of the duplicates appearing, not sure if we still need to look into why that happened in the first place though? Thinking we'd want to add a check here to see if the audio pipeline first has been initialized before attempting to deinit it?

On an unrelated note seems my replies are truncated again, but I assume that is just because issue 159 from the willow repo isn't merged yet 😄

Kristian Kielhofner · Answer 3 · Tue Jun 27 2023 00:21:27 GMT+0800 (China Standard Time)

I was just debugging this issue myself!

Using addr2line (see NOTES.md) you have this exactly right. I've also noticed that if I include a willow.json in the SPIFFS user partition prior to flash this issue is avoided (makes sense).

I've also noticed occasional duplicates. While it figures itself out eventually it's also "not great" and potentially confusing to users.

From what I remember checking if an audio pipeline has been started isn't well supported via ESP-ADF so we have to get creative addressing it via that route.

I've also noted (via all debug logging enabled) that deinit aggressively loops and (if debug logging) spams the console, which is another "not great" thing that would likely be addressed via whatever we come up to address this issue.

Nick Bento · Answer 4 · Tue Jun 27 2023 00:24:41 GMT+0800 (China Standard Time)

Gotcha. I was going to add a simple null checkj on hdl_at and hdl_ap in the deinit method, not sure that would be effective though? (Don't have any knowledge of the ESP-ADF framework yet so not sure if that will cause some sort of other issues 😆 )

Kristian Kielhofner · Answer 5 · Tue Jun 27 2023 00:27:33 GMT+0800 (China Standard Time)

I would want to consult with @stintel on how to best approach this - he's currently in really deep on a bunch of other WAS stuff but hopefully we can get to this later today.

We'll be deep in Willow to support dynamic configuration of wake word anyway so this lines up pretty well.

Nick Bento · Answer 6 · Tue Jun 27 2023 00:28:49 GMT+0800 (China Standard Time)

Gotcha, no problem! I have a temp workaround in place for now so I'll continue poking around other areas. I also tested OTA and that worked perfectly so far! 😃

Kristian Kielhofner · Answer 7 · Tue Jun 27 2023 00:31:06 GMT+0800 (China Standard Time)

Speaking of OTA - we've been really impressed with how FAST it is. Whether dynamic config updates or OTA the vast majority of time is spent on the reboot to connect to Wifi - especially if you have KRACK mitigation active on the AP(s).

In any case 10-15 seconds to update the config or firmware on all of your devices is very promising!

Nick Bento · Answer 8 · Tue Jun 27 2023 00:33:14 GMT+0800 (China Standard Time)

Indeed I noticed the same, downtime for an OTA/Config push is very minimal!

Stijn Tintel · Answer 9 · Tue Jun 27 2023 01:03:47 GMT+0800 (China Standard Time)

Can you test if toverainc/willow@27e6130 fixes the deinit_audio crash?

Kristian Kielhofner · Answer 10 · Tue Jun 27 2023 01:29:57 GMT+0800 (China Standard Time)

Confirmed it fixes the boot loop but initial apply looks kind of rough (not sure it matters):

W (17:25:44.706) WILLOW/CONFIG: key audio_response_type not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:44.753) WILLOW/CONFIG: key speech_rec_mode not found in config, use bogus value to avoid NULL pointer dereference
I (17:25:44.777) WILLOW/HASS: stopping WebSocket client
I (17:25:44.777) WILLOW/WAS: stopping WebSocket client
E (17:25:44.783) WILLOW/HASS: failed to stop WebSocket client: ESP_ERR_INVALID_ARG
W (17:25:44.797) WILLOW/CONFIG: key audio_response_type not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:44.809) WILLOW/CONFIG: key speech_rec_mode not found in config, use bogus value to avoid NULL pointer dereference

----------------------------- ESP Audio Platform -----------------------------
|                                                                            |
|                       ESP_AUDIO-v1.7.2-20e6bd0-b92a149                     |
|                     Compile date: Nov 30 2022-07:50:12                     |
------------------------------------------------------------------------------
E (8454) ESP_AUDIO_CTRL: Error input parameter. line:1163
I (17:25:44.865) WILLOW/AUDIO: audio player initialized
E (17:25:44.868) I2S: register I2S object to platform failed
W (17:25:44.880) WILLOW/CONFIG: key wake_mode not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:44.886) WILLOW/CONFIG: key wake_mode not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:44.897) WILLOW/CONFIG: key wake_mode not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:44.908) WILLOW/CONFIG: key wake_mode not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:44.922) WILLOW/CONFIG: key wake_mode not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:44.930) WILLOW/CONFIG: key wake_mode not found in config, use bogus value to avoid NULL pointer dereference
I (17:25:44.942) WILLOW/AUDIO: Using record buffer '-1'
W (17:25:44.949) WILLOW/CONFIG: key speech_rec_mode not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:44.960) WILLOW/CONFIG: key audio_codec not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:44.970) WILLOW/CONFIG: key audio_codec not found in config, use bogus value to avoid NULL pointer dereference
E (8579) AFE_SR: vad_mode is error, please modify it!

E (8580) AFE_SR: AFE config error!

E (17:25:44.992) RECORDER_SR: recorder_sr.c:562 (recorder_sr_create): Got NULL Pointer
W (17:25:45.006) WILLOW/CONFIG: key audio_codec not found in config, use bogus value to avoid NULL pointer dereference
W (17:25:45.012) WILLOW/CONFIG: key audio_codec not found in config, use bogus value to avoid NULL pointer dereference
I (17:25:45.044) WILLOW/AUDIO: app_main() - start_rec() finished
E (17:25:45.049) lcd_panel.io.i2c: panel_io_i2c_rx_buffer(128): i2c transaction failed
E (17:25:45.052) TT21100: esp_lcd_touch_tt21100_read_data(173): I2C read error!
E (17:25:45.057) TT21100: esp_lcd_touch_new_i2c_tt21100(103): TT21100 init failed
E (17:25:45.065) TT21100: Error (0xffffffff)! Touch controller TT21100 initialization failed!
E (17:25:45.076) WILLOW/LVGL: failed to initialize touch screen: ESP_FAIL
I (17:25:45.082) WILLOW/NETWORK: MAC address: 7c:df:a1:e8:20:58
I (17:25:45.088) WILLOW/MAIN: Startup complete! Version: 27e6130. Waiting for wake word.
I (17:25:45.124) WILLOW/CONFIG: /spiffs/user/config/willow.json updated, restarting
I (17:25:45.125) WILLOW/SYSTEM: restarting after 6 seconds

Coming back (ESP BOX Lite - so expected touch errors):

I (17:25:58.399) WILLOW/WAS: initializing WebSocket client
I (17:25:58.400) WILLOW/NETWORK: initializing SNTP client
I (12:25:58.403) WILLOW/NETWORK: Using DHCP SNTP server
I (12:25:58.408) WILLOW/HASS: HASS URL: http://hass:8123/api/components
I (12:25:58.432) WILLOW/WAS: WebSocket connected
I (12:25:58.440) WILLOW/HTTP: HTTP status='200' content_length='2279'
I (12:25:58.444) WILLOW/HASS: Home Assistant has Assist Pipeline support
I (12:25:58.445) WILLOW/HASS: HASS URL: ws://hass:8123/api/websocket
I (12:25:58.464) WILLOW/HASS: WebSocket connected
I (12:25:58.497) WILLOW/AUDIO: audio_hal_ctrl_codec: ESP_OK
I (12:25:58.500) WILLOW/AUDIO: audio_element_getinfo(hdl_ae_hs): sample_rate='44100' channels='2' bits='16' bps = '0'

----------------------------- ESP Audio Platform -----------------------------
|                                                                            |
|                       ESP_AUDIO-v1.7.2-20e6bd0-b92a149                     |
|                     Compile date: Nov 30 2022-07:50:12                     |
------------------------------------------------------------------------------
I (12:25:58.543) WILLOW/AUDIO: audio player initialized
E (12:25:58.545) I2S: register I2S object to platform failed
I (12:25:58.553) WILLOW/AUDIO: Using record buffer '6'
MC Quantized wakenet9: wakenet9l_v3h24_alexa_3_0.625_0.645, tigger:v3, mode:3, p:0, (Jun 14 2023 11:15:21)
I (12:25:58.731) WILLOW/AUDIO: app_main() - start_rec() finished
E (12:25:58.734) lcd_panel.io.i2c: panel_io_i2c_rx_buffer(128): i2c transaction failed
E (12:25:58.736) TT21100: esp_lcd_touch_tt21100_read_data(173): I2C read error!
E (12:25:58.744) TT21100: esp_lcd_touch_new_i2c_tt21100(103): TT21100 init failed
E (12:25:58.752) TT21100: Error (0xffffffff)! Touch controller TT21100 initialization failed!
E (12:25:58.762) WILLOW/LVGL: failed to initialize touch screen: ESP_FAIL
I (12:25:58.768) WILLOW/NETWORK: MAC address: 7c:df:a1:e8:20:58
I (12:25:58.775) WILLOW/MAIN: Startup complete! Version: 27e6130. Waiting for wake word.
I (12:26:08.784) WILLOW/TIMER: Wake LCD timeout, turning off LCD
I (12:26:23.326) WILLOW/AUDIO: AUDIO_REC_WAKEUP_START
I (12:26:23.759) WILLOW/AUDIO: AUDIO_REC_VAD_START
I (12:26:23.762) WILLOW/AUDIO: Using WIS URL 'http://wis:20001/api/willow?model=base'
I (12:26:23.764) WILLOW/AUDIO: WIS HTTP client starting stream, waiting for end of speech
I (12:26:25.445) WILLOW/AUDIO: AUDIO_REC_VAD_END
I (12:26:25.446) WILLOW/AUDIO: AUDIO_REC_WAKEUP_END
I (12:26:25.498) WILLOW/AUDIO: WIS HTTP client HTTP_STREAM_POST_REQUEST, write end chunked marker
I (12:26:25.561) WILLOW/AUDIO: WIS HTTP client HTTP_STREAM_FINISH_REQUEST
I (12:26:25.561) WILLOW/AUDIO: WIS HTTP Response = {"language":"en","text":"Turn off dining room."}
I (12:26:25.569) WILLOW/HASS: sending command to Home Assistant via WebSocket: {
	"end_stage":	"intent",
	"id":	1687800385,
	"input":	{
		"text":	"Turn off dining room."
	},
	"start_stage":	"intent",
	"type":	"assist_pipeline/run"
}
I (12:26:25.689) WILLOW/HASS: home assistant response_type: action_done
I (12:26:25.694) WILLOW/HASS: received run-end event on WebSocket: {
	"id":	1687800385,
	"type":	"event",
	"event":	{
		"type":	"run-end",
		"data":	null,
		"timestamp":	"2023-06-26T17:27:08.146712+00:00"
	}
}
I (12:26:25.707) WILLOW/AUDIO: Using WIS TTS URL 'http://wis:20001/api/tts?speaker=CLB&text=Turned off light'
I (12:26:26.826) WILLOW/AUDIO: WIS TTS playback finished

Nick Bento · Answer 11 · Tue Jun 27 2023 01:38:26 GMT+0800 (China Standard Time)

Also confirm the core panic/bootloop is fixed with this commit!

Stijn Tintel · Answer 12 · Tue Jun 27 2023 01:43:23 GMT+0800 (China Standard Time)

Confirmed it fixes the boot loop but initial apply looks kind of rough

Try toverainc/willow@9eba78e?

Kristian Kielhofner · Answer 13 · Tue Jun 27 2023 01:51:55 GMT+0800 (China Standard Time)

NICE - about as clean as it's going to get!

Stijn Tintel · Answer 14 · Tue Jun 27 2023 01:53:59 GMT+0800 (China Standard Time)

When flashing the WAS willow build onto the ESP box, I am seeing that it is getting duplicated in WAS:

I don't think this is something we can avoid. When Willow crashes, there is no clean close/disconnect of the WebSocket. WAS will notice this after a while, and cleanup the dead client from ConnMgr.

Kristian Kielhofner · Answer 15 · Tue Jun 27 2023 01:58:57 GMT+0800 (China Standard Time)

I think is minor enough to be (potentially) addressed post initial stable release.

As one idea, in WAS when OTA update is initiated could we remove the connection (device) from WAS? It won't appear until it comes back.

Ideally when we have a better frontend UI we could also use websockets (or something) so the Clients "page" doesn't need to be manually refreshed - which I think itself would address most of this.

Stijn Tintel · Answer 16 · Tue Jun 27 2023 02:00:21 GMT+0800 (China Standard Time)

As one idea, in WAS when OTA update is initiated could we remove the connection (device) from WAS? It won't appear until it comes back.

We already do that: https://github.com/toverainc/willow/blob/feature/was/main/ota.c#L176

Nick Bento · Answer 17 · Tue Jun 27 2023 02:03:44 GMT+0800 (China Standard Time)

I noticed as well it will clear up after a bit, and the OTA also will make them disappear if invalid as @stintel said. I agree I don't think this is really a show stopper, but sharing that may be nice to work on later. 🙂

Kristian Kielhofner · Answer 18 · Tue Jun 27 2023 02:40:35 GMT+0800 (China Standard Time)

I was suggesting removing them from WAS (in WAS) as soon as the OTA command is sent.

In any case I should probably familiarize myself with the WAS code a bit!

Kristian Kielhofner · Answer 19 · Mon Sep 25 2023 20:21:36 GMT+0800 (China Standard Time)

In the wasng branch WAS makes sure the client devices returned via API are unique by MAC.