Do not destroy mutexes in `coap_cleanup()`
hasheddan opened this issue · comments
Environment
- Build System: CMake
- Operating System: Ubuntu
- Operating System Version: 20.04
- Hosted Environment: ESP-IDF
libcoap Configuration Summary
Configuration can be found here: https://github.com/golioth/golioth-firmware-sdk/pull/163/files#diff-152d8027819ee12f017ab0edd01f2b2f218d24b7f751070b7d130842be298a3a
Problem Description
Tests run here reveal a null queue when acquiring a semaphore (see assert here). The decoded stack trace looks as follows:
0x4037b36f: xt_utils_compare_and_set at /home/hasheddan/code/github.com/espressif/esp-idf/components/xtensa/include/xt_utils.h:215
(inlined by) esp_cpu_compare_and_set at /home/hasheddan/code/github.com/espressif/esp-idf/components/esp_hw_support/cpu.c:483
0x40380bf9: spinlock_acquire at /home/hasheddan/code/github.com/espressif/esp-idf/components/esp_hw_support/include/spinlock.h:121
(inlined by) xPortEnterCriticalTimeout at /home/hasheddan/code/github.com/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:501
0x4037e4b4: vPortEnterCritical at /home/hasheddan/code/github.com/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos/portmacro.h:584
(inlined by) xQueueSemaphoreTake at /home/hasheddan/code/github.com/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/queue.c:1671
0x4200390d: pthread_mutex_lock_internal at /home/hasheddan/code/github.com/espressif/esp-idf/components/pthread/pthread.c:614
0x42003a9a: pthread_mutex_lock at /home/hasheddan/code/github.com/espressif/esp-idf/components/pthread/pthread.c:644
0x420189c6: coap_log_impl at /home/hasheddan/code/github.com/golioth/golioth-firmware-sdk/external/libcoap/src/coap_debug.c:1199
0x4201b653: setup_client_ssl_session at /home/hasheddan/code/github.com/golioth/golioth-firmware-sdk/external/libcoap/src/coap_mbedtls.c:1151 (discriminator 1)
0x4201b7e5: coap_dtls_new_mbedtls_env at /home/hasheddan/code/github.com/golioth/golioth-firmware-sdk/external/libcoap/src/coap_mbedtls.c:1518
0x4201bb97: coap_dtls_new_client_session at /home/hasheddan/code/github.com/golioth/golioth-firmware-sdk/external/libcoap/src/coap_mbedtls.c:1830
0x42022aaa: coap_dtls_establish at /home/hasheddan/code/github.com/golioth/golioth-firmware-sdk/external/libcoap/src/coap_dtls.c:25
0x42020b3b: coap_session_check_connect at /home/hasheddan/code/github.com/golioth/golioth-firmware-sdk/external/libcoap/src/coap_session.c:1248
0x4202202a: coap_new_client_session_psk2 at /home/hasheddan/code/github.com/golioth/golioth-firmware-sdk/external/libcoap/src/coap_session.c:1370 (discriminator 3)
0x42010921: create_session at /home/hasheddan/code/github.com/golioth/golioth-firmware-sdk/src/golioth_coap_client.c:601 (discriminator 85)
0x4201223a: golioth_coap_client_thread at /home/hasheddan/code/github.com/golioth/golioth-firmware-sdk/src/golioth_coap_client.c:894
0x403809d1: vPortTaskWrapper at /home/hasheddan/code/github.com/espressif/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162
Expected Behavior
To not have the assert fail.
Actual Behavior
The failing assert.
Steps to reproduce
See previously linked source code and test run here: https://github.com/golioth/golioth-firmware-sdk/actions/runs/6115076185/job/16598003472#step:9:443
Code to reproduce this issue
I believe the issue here is that we are calling coap_cleanup()
, then attempting to create a new session. In cd7b5de coap_cleanup()
was modified to destroy mutexes. Unfortunately, we cannot re-initialize these mutexes because calling coap_startup()
again will short-circuit due to coap_started
being set to 1
on the first time it was called.
I have tested a patch where the mutex destroy block is eliminated and the tests are passing correctly again.
I have opened a fix in #1226.
#1226 is not the answer here.
coap_startup()
should be called at the start of golioth_coap_client_thread()
before the while(1)
. For example if any coap_log*()
function is called before coap_startup()
is called, then the mutex needed will fail. See coap_startup(3).
Note: coap_startup()
is called in coap_new_context()
'just in case coap_startup()
is not explicitly called'.
I'm not sure you need to call coap_cleanup()
at all in your code, other than if create_context()
or create_session()
fail, at which point this goliath client thread is useless and everything should be cleaned up.