ARMmbed / mbed-os-example-for-azure

Mbed OS example to connect to Azure IoT Hub

Repository from Github https://github.comARMmbed/mbed-os-example-for-azureRepository from Github https://github.comARMmbed/mbed-os-example-for-azure

Long term connection causes hard fault

AGlass0fMilk opened this issue · comments

Description of defect

Hi, we are developing an extended example application based on this example. Basically, instead of connecting for a few seconds and sending 10 messages, we are connecting and sending a single message every 10 minutes.

Based on traces, it appears the Azure client periodically pings the IoTHub instance. After about 40 minutes, the example crashes with a hard fault.

Based on traces and halting program execution, the connection to the IoTHub is being disconnected. This occurs because the IoTHub connection is based on a shared access signature (SAS) token that has an expiration of less than one hour -- this explains the repeatable delay before the fault occurs.

I halted the program during the connection status callback and the following reason code was being passed in by the Azure IoT SDK: IOTHUB_CLIENT_CONNECTION_EXPIRED_SAS_TOKEN

After disconnection, another component of the application attempts to access the (now closed) IoTHub client connection and the hard fault occurs.

This is related to Azure/azure-iot-sdk-c#1625

We are working on modifying the code to destroy the IoTHub client before the SAS token expires and reconnecting to renew the token periodically.

See attached crash log capture.

crash-1.log

Target(s) affected by this defect ?

All targets

Toolchain(s) (name and version) displaying this defect ?

All

What version of Mbed-os are you using (tag or sha) ?

master

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

(python3-mbed) gdbeckstein@magic-man:~/Documents/embeddedplanet$ pip list
Package             Version     Location
------------------- ----------- -------------------------------------------------------------------
aenum               3.0.0
appdirs             1.4.3
asn1ate             0.6.0
attrs               19.3.0
Automat             20.2.0
beautifulsoup4      4.6.3
ble-serial          2.0.0
bleak               0.12.1
bluepy              1.3.0
cbor                1.0.0
certifi             2019.11.28
cffi                1.14.1
chardet             3.0.4
click               7.1
cmsis-pack-manager  0.2.10
cobs                1.1.4
cogapp              3.0.0
colorama            0.3.9
coloredlogs         15.0
constantly          15.1.0
crc16               0.1.1
crccheck            0.6
cryptography        2.9.2
cycler              0.10.0
dataclasses         0.8
dbus-next           0.2.2
distlib             0.3.1
docopt              0.6.2
docutils            0.17.1
ecdsa               0.15
elftools            0.1.0.dev0
ep                  0.0.1       /home/gdbeckstein/Documents/embeddedplanet/ep-app-generator
fasteners           0.15
filelock            3.0.12
Flask               1.1.2
future              0.16.0
futures             3.1.1
fuzzywuzzy          0.18.0
gitdb               4.0.5
GitPython           3.1.13
grip                4.5.2
hidapi              0.9.0.post2
humanfriendly       9.1
hyperlink           20.0.1
icetea              1.2.4
idna                2.7
imgtool             1.7.0rc1
importlib-metadata  1.6.0
importlib-resources 5.1.2
incremental         17.5.0
inflection          0.5.1
iniconfig           1.1.1
intelhex            2.2.1
intervaltree        3.1.0
itsdangerous        1.1.0
Jinja2              2.10.3
jsonmerge           1.7.0
jsonschema          2.6.0
junit-xml           1.8
keyrings.alt        4.0.2
kiwisolver          1.2.0
lockfile            0.12.2
Logbook             1.5.3
Mako                1.1.4
manifest-tool       1.5.2
Markdown            3.3.2
MarkupSafe          1.1.1
matplotlib          3.3.0
mbed-ble-test-suite 0.0.1       /home/gdbeckstein/Documents/mbed-os-bluetooth-integration-testsuite
mbed-cli            1.10.4
mbed-cloud-sdk      2.0.8
mbed-flasher        0.10.1
mbed-greentea       1.7.4
mbed-host-tests     1.5.10
mbed-ls             1.7.12
mbed-os-tools       0.0.15
mbed-tools          7.1.2
milksnake           0.1.5
monotonic           1.5
numpy               1.19.1
packaging           20.4
path-and-address    2.0.1
pc-ble-driver-py    0.14.2
pdoc3               0.9.2
Pillow              7.2.0
pip                 20.2
pkg-resources       0.0.0
pluggy              0.13.1
prettytable         0.7.2
protobuf            3.5.2.post1
psutil              5.6.6
py                  1.9.0
pyasn1              0.2.3
pycparser           2.20
pycryptodome        3.9.8
pyelftools          0.25
Pygments            2.7.1
PyHamcrest          2.0.2
pyparsing           2.4.7
pyrsistent          0.16.0
pyserial            3.4
pytest              6.1.1
python-can          3.3.4
python-dateutil     2.8.1
python-dotenv       0.14.0
pyudev              0.22.0
pyusb               1.0.2
PyYAML              4.2b1
requests            2.20.1
semver              2.10.2
setuptools          46.1.3
six                 1.12.0
smmap               3.0.5
sortedcontainers    2.4.0
soupsieve           2.0
tabulate            0.8.9
toml                0.10.1
tqdm                4.57.0
trollius            2.1.post2
Twisted             20.3.0
txdbus              1.1.2
typing-extensions   3.7.4.3
urllib3             1.24.2
virtualenv          20.4.3
Werkzeug            1.0.1
wheel               0.34.2
wrapt               1.12.1
yattag              1.13.2
zipp                3.1.0
zope.interface      5.2.0

How is this defect reproduced ?

Attempt to keep the client connected and send data for longer than 45 minutes.

Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers.
Internal Jira reference: https://jira.arm.com/browse/IOTOSM-4232

This is related to Azure/azure-iot-sdk-c#1625

We are working on modifying the code to destroy the IoTHub client before the SAS token expires and reconnecting to renew the token periodically.

Thanks for the analysis, that's really helpful!

@LDong-Arm Have you seen this kind of behavior in any applications ARM has helped develop using the Azure client port?

I've tried disconnecting after 30 minutes and reconnecting before the timeout expires, but there is still a crash happening related to this.

Perhaps there is a bug in the Azure SDK version this port uses.

@AGlass0fMilk
We haven't yet tried having connection ongoing for a long time. I don't think the original Mbed support inside the Azure SDK repo was developed by us - it was from the community. In bringing up the Azure example (this repository), we took the existing code from the Azure client and upgraded it from Mbed OS 5 to 6.

Thanks again for looking into this. Hopefully the linked issue in the SDK will be resolved.

@LDong-Arm I believe I have identified the source of the problem:

When disconnecting (due to SAS token expiry or otherwise), the Azure IoTHub client SDK frees the XIO transport driver at a layer above the MQTT client. The MQTT client is not notified of this and is left with a dangling pointer to the XIO transport. The next time the "do_work" function is called on the MQTT client, it dereferences this deleted XIO transport pointer and that causes a hard fault.

I have introduced a change in AGlass0fMilk/azure-umqtt-c@7e12aad and it seems to have fixed the issue.

I have my SAS token lifetime set to 1 minute to make the issue occur faster. So far it's been running for over an hour and disconnecting/reconnecting without crashing, even sending messages in between!

I'm not sure if this is resolved in the latest version of the Azure IoTHub SDK.

Is there any plan to update this port with the latest version of the Azure IoTHub C SDK?

For now, this is the fix...

Thanks @AGlass0fMilk, good to know! Is it sensible to upstream the patch?

We can update https://github.com/ARMmbed/mbed-client-for-azure to use the latest release of the SDK (azure-iot-sdk-c.lib) and its dependencies (dependencies/*.lib) or accept PRs that update them at any time, as I expect their interfaces to be fairly stable. Or using a specific commit on the SDK's or a dependency's master branch should also be possible, if we want a fix without waiting for the next release.

@LDong-Arm We are currently using LTS_07_2020_Ref01 release of the Azure IoT C SDK. Support for that LTS version ended last month, so I don't think they'd be interested in upstreaming this patch... I'll mention it on their GitHub issues and see if they have any info (eg: has it been fixed?)

I'll try updating the version here along with its dependencies and rerun my tests.

@AGlass0fMilk Thanks for letting us know, it'd be good to see if this patch is still needed or works with the latest SDK.