Program gets stuck in reader.stop_reading()

Question

Program gets stuck in reader.stop_reading()

HanYangZhao opened this issue 4 years ago · comments

I'm testing the program by repeating reader.start_reading() and reader.stop_reading(). At some point (few minutes or few hours), reader.stop_reading() will not return resulting in the program getting stuck. I was unable to replicate the problem with a pure C implementation, so, there is something going on with the way python threads is mixed with c threads. Using gdb I was able to confirm that one of the c threads gets stuck

30 Jun 06:45:18 2020 - initializeReader
30 Jun 06:45:18 2020 - stopping current reader
stopping_read
read_callback_null
stats_callback_null
stopping_read_cs
read_callback
^C
Thread 1 "python3" received signal SIGINT, Interrupt.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) info thread
  Id   Target Id         Frame 
* 1    Thread 0x76fee210 (LWP 3019) "python3" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  2    Thread 0x769b3460 (LWP 3025) "python3" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  3    Thread 0x75fff460 (LWP 3026) "python3" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
  4    Thread 0x757fe460 (LWP 3027) "python3" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
(gdb) bt 
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76ec3072 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x3f22b4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#2  __pthread_cond_wait_common (abstime=0x0, mutex=0x3f21e0, cond=0x3f2288) at pthread_cond_wait.c:502
#3  __pthread_cond_wait (cond=0x3f2288, mutex=0x3f21e0) at pthread_cond_wait.c:655
#4  0x76b6757a in TMR_stopReading (reader=0x3f10e0) at tm_reader_async.c:387
#5  0x76b5fab0 in Reader_stop_reading (self=0x3f10d8) at mercury.c:976
#6  0x0009ffb2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) thread 4
[Switching to thread 4 (Thread 0x757fe460 (LWP 3027))]
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      in ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76ec5194 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=1, futex_word=0x3f21a8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#2  do_futex_wait (sem=sem@entry=0x3f21a8, abstime=0x0) at sem_waitcommon.c:115
#3  0x76ec5274 in __new_sem_wait_slow (sem=0x3f21a8, abstime=0x0) at sem_waitcommon.c:282
#4  0x76b6839e in process_async_response (reader=0x3f10e0) at tm_reader_async.c:977
#5  0x76b689b2 in do_background_reads (arg=0x3f10e0) at tm_reader_async.c:1218

tm_reader_async.c:1218  -> process_async_response(reader);
tm_reader_async.c:977 -> sem_wait(&reader->queue_slots);

charlescochran commented 2 years ago

Yep

kmauser · Answer 1 · Fri Dec 10 2021 21:55:57 GMT+0800 (China Standard Time)

I'm experiencing a similar situation on Windows; program hangs sometimes when stop_reading() is called. The reader stops, but the program never returns. It's easy to make happen when doing a repeated sequence of start_reading(), delay for a few seconds, and then stop_reading(). I built python-mercury module from source using Mercury API 1.35.

charlescochran · Answer 2 · Sat May 21 2022 02:29:16 GMT+0800 (China Standard Time)

I ran into this issue as well. After much work with gdb, I discovered the issue and implemented a fix (see the PR I just opened). Here is my commit message explaining the isse:

As of mercuryapi-1.31.4.35, when tm_reader_async.c's TMR_stopReading()
is called by Reader_stop_reading(), it ultimately ends up waiting for
reader->readState to become TMR_READ_STATE_DONE while holding the GIL.
This happens in tm_reader_async.c's do_background_reads(), but not until
the tag queue is emptied by read callbacks. These are triggered by
invoke_read_callback(), but this function blocks until it can acquire
the GIL. This frequently results in deadlock upon attempting to stop an
asynchronous read.

Previously, this was partially addressed by temporarily setting
self->readCallback (and self->statsCallback) to NULL in
Reader_stop_reading(), but this causes a SEGV in invoke_read_callback()
if it occurs after self->readCallback is verified to not be NULL but
before it is called. Instead of temporarily setting these function
pointers to NULL, release the GIL in Reader_stop_reading(), allowing
invoke_read_callback() to empty the tag queue and thereby avoiding
deadlock. Then, use a mutex to prevent concurrent calls to
TMR_stopReading(), since this was previously prevented by the GIL.

Temporarily setting self->statsCallback to NULL in Reader_stop_reading()
also prevented stats callbacks from occurring in subsequent asynchronous
reads. This functionality is now fixed.

Nathanaël Lécaudé · Answer 3 · Sat May 21 2022 04:19:23 GMT+0800 (China Standard Time)

Very nice ! We've been waiting for a fix for this for a long time, will try it out, hope this gets merged !
Thanks !

Petr Gotthard · Answer 4 · Sat May 21 2022 04:28:24 GMT+0800 (China Standard Time)

Thanks a lot for your effort, @charlescochran!

@natcl, please let me know whether the fix works for you as well. I will merge it then.

Nathanaël Lécaudé · Answer 5 · Sat May 21 2022 04:41:49 GMT+0800 (China Standard Time)

Thanks, it might take a little while before we can test but I'll let you know !

Nathanaël Lécaudé · Answer 6 · Fri Jun 03 2022 21:47:08 GMT+0800 (China Standard Time)

@charlescochran I tried your pull request and I'm still getting a freeze on stopReading after a couple of minutes of calling it in a loop (every 4 seconds)
Were you able to test it over a long period of time on you side ?

charlescochran · Answer 7 · Fri Jun 03 2022 22:14:42 GMT+0800 (China Standard Time)

Hmm. I have been regularly testing it over long periods of time and it has been not freezing.

Perhaps you are encountering a separate bug. Can you run your program using gdb? If so, when the hang happens, please hit Ctrl-C, do a bt, and paste the result here.

Nathanaël Lécaudé · Answer 8 · Fri Jun 03 2022 22:20:11 GMT+0800 (China Standard Time)

Thanks will do !

Nathanaël Lécaudé · Answer 9 · Fri Jun 03 2022 22:34:14 GMT+0800 (China Standard Time)

itp@itp-master:/usr/tpm/bin$ gdb python3
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...(no debugging symbols found)...done.
(gdb) run uhfReader.py
Starting program: /usr/bin/python3 uhfReader.py
Cannot parse expression `.L1207 4@r4'.
warning: Probes-based dynamic linker interface failed.
Reverting to original interface.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
03 Jun 23:47:11 2022 - Starting ThingMagic reader
[New Thread 0x7683c460 (LWP 21173)]
03 Jun 23:47:11 2022 - Heartbeat started
03 Jun 23:47:11 2022 - initializeReader
03 Jun 23:47:11 2022 - stopReading: Reading was already stopped
03 Jun 23:47:11 2022 - Creating new reader on port tmr:///dev/ttymxc0
03 Jun 23:47:11 2022 - {"command": "initializeReader", "antennaPort": 1, "readPower": [[1, 2000], [2, 2000]], "region": "JP", "readerInitialized": true, "fwVersion": "01.0B.04.04-20.20.02.13-BL12.12.13.00", "serialNb": "3618527017600UM+", "readState": "TMR_READ_STATE_IDLE"}
03 Jun 23:47:11 2022 - startReading at 50 50 duty cycle
[New Thread 0x75eff460 (LWP 21181)]
[New Thread 0x756fe460 (LWP 21182)]
03 Jun 23:48:17 2022 - stopReading
^C
Thread 1 "python3" received signal SIGINT, Interrupt.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76f770d4 in __libc_recvfrom (fd=4, buf=0x7601b018, len=131072, flags=0, addr=..., addrlen=0x7effee38) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
#2  0x0002d2f2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Nathanaël Lécaudé · Answer 10 · Fri Jun 03 2022 22:34:49 GMT+0800 (China Standard Time)

I think I might have something missing to get the full stack ?

Nathanaël Lécaudé · Answer 11 · Fri Jun 03 2022 22:43:28 GMT+0800 (China Standard Time)

Ah I installed ionotify and it seems better, new log coming soon

Nathanaël Lécaudé · Answer 12 · Fri Jun 03 2022 22:46:09 GMT+0800 (China Standard Time)

That seems better:

04 Jun 00:05:17 2022 - initializeReader
04 Jun 00:05:17 2022 - stopReading: Reading was already stopped
04 Jun 00:05:17 2022 - {"command": "initializeReader", "antennaPort": 1, "readPower": [[1, 2000], [2, 2000]], "region": "JP", "readerInitialized": true, "fwVersion": "01.0B.04.04-20.20.02.13-BL12.12.13.00", "serialNb": "3618527017600UM+", "readState": "TMR_READ_STATE_DONE"}
04 Jun 00:05:17 2022 - startReading at 50 50 duty cycle
04 Jun 00:05:21 2022 - stopReading
04 Jun 00:05:25 2022 - initializeReader
04 Jun 00:05:25 2022 - stopReading: Reading was already stopped
04 Jun 00:05:25 2022 - {"command": "initializeReader", "antennaPort": 1, "readPower": [[1, 2000], [2, 2000]], "region": "JP", "readerInitialized": true, "fwVersion": "01.0B.04.04-20.20.02.13-BL12.12.13.00", "serialNb": "3618527017600UM+", "readState": "TMR_READ_STATE_DONE"}
04 Jun 00:05:25 2022 - startReading at 50 50 duty cycle
04 Jun 00:05:29 2022 - stopReading
04 Jun 00:05:33 2022 - initializeReader
04 Jun 00:05:33 2022 - stopReading: Reading was already stopped
04 Jun 00:05:33 2022 - {"command": "initializeReader", "antennaPort": 1, "readPower": [[1, 2000], [2, 2000]], "region": "JP", "readerInitialized": true, "fwVersion": "01.0B.04.04-20.20.02.13-BL12.12.13.00", "serialNb": "3618527017600UM+", "readState": "TMR_READ_STATE_DONE"}
04 Jun 00:05:33 2022 - startReading at 50 50 duty cycle
^C
Thread 1 "python3" received signal SIGINT, Interrupt.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76ec308e in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x3a382c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#2  __pthread_cond_wait_common (abstime=0x0, mutex=0x3a3758, cond=0x3a3800) at pthread_cond_wait.c:502
#3  __pthread_cond_wait (cond=0x3a3800, mutex=0x3a3758) at pthread_cond_wait.c:655
#4  0x769f8682 in TMR_stopReading () from /usr/lib/python3/dist-packages/mercury.cpython-36m-arm-linux-gnueabihf.so
#5  0x769f371c in ?? () from /usr/lib/python3/dist-packages/mercury.cpython-36m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

charlescochran · Answer 13 · Fri Jun 03 2022 22:53:31 GMT+0800 (China Standard Time)

Interesting. Please run git log --oneline and paste the result.

Nathanaël Lécaudé · Answer 14 · Fri Jun 03 2022 23:40:32 GMT+0800 (China Standard Time)

Oh you want to verify if I'm running the correct code ?
I'm not actually running this from a Git repo, it's packaged as a Debian file with other things.
Let me check on my side to confirm I'm running your branch.

charlescochran · Answer 15 · Fri Jun 03 2022 23:45:59 GMT+0800 (China Standard Time)

I want to verify that, and also that you have the other latest commits on master. There have been 6 commits since the latest release. I think I remember the latest release having bugs that were fixed by them. My branch is on the latest master rather than the latest release, so as long as you are not cherry-picking it, you should also have those bug fixes as well (assuming you are running my branch, of course).

Nathanaël Lécaudé · Answer 16 · Sat Jun 04 2022 00:04:07 GMT+0800 (China Standard Time)

So here's the commit that was included on our side:
I basically merged your master branch.

charlescochran · Answer 17 · Sat Jun 04 2022 00:19:55 GMT+0800 (China Standard Time)

That looks correct, in the sense that that is all of my changes. Can you verify that you have these commits, as well?
v0.5.3...master
Also, when you say "merged my master branch", do you mean that you merged it with your copy of upstream code or code that includes your modifications? If the latter, maybe that's the reason for our different results.
In terms of your error: it does look similar to the deadlock I got (and thought I fixed), which is interesting. From your traces, it looks like you are calling stopReading twice for every call to startReading, once after startReading and once after initializeReader. The second one seems to give an error ("Reader was already stopped"). If you are actually calling the python wrapper's stop_reading() function twice, I would not be surprised if this results in a deadlock/error, even with my commit. I do not think the library supports multiple consecutive calls to stop_reading(), and it was not the problem I was attempting to fix.

Nathanaël Lécaudé · Answer 18 · Sat Jun 04 2022 01:30:38 GMT+0800 (China Standard Time)

I do have those commits as well. I merged your branch but on a more recent local version (that has the get_read_state() method from this PR: #117)
So perhaps this could be interfering ?
I'll try your branch without any other modifications.

We're only calling stopReading once, the reason it says "Reader was already stopped" is that if the reader is started when we call initializeReader, we stop it, but in this case it's not called.

Nathanaël Lécaudé · Answer 19 · Sat Jun 04 2022 04:03:00 GMT+0800 (China Standard Time)

I checked my program again and we are calling other methods before start_reading, notably we're calling set_region, set_read_plan and set_read_powers. This may be the cause for the lockup.
I restarted the test by calling only start_reading and stop_reading and so far no crash (it's been running 3 hours).
I'll leave it running over night but I think that the fix for stop_reading does indeed work !
I think the other lockup I have is not related to that specific fix.
Will report tomorrow !

Nathanaël Lécaudé · Answer 20 · Sat Jun 04 2022 04:08:59 GMT+0800 (China Standard Time)

I may have spoke too fast, just after hitting send it crashed again, I wasn't running gdb this time so will run it again with gdb.
I have the impression that it got stuck in start_reading rather than stop_reading....

charlescochran · Answer 21 · Sat Jun 04 2022 04:38:42 GMT+0800 (China Standard Time)

Gotcha. I have never seen it get stuck in start_reading, so I did not attempt to fix that. From my experience, the deadlock in stop_reading happens quite quickly (after a few seconds/minutes), especially if you increase the rate of starting and stopping (try every 0.25 seconds instead of every 4). Perhaps you could test again without my commit and see if it deadlocks quickly in stop_reading. Then test with my fix, and hopefully it will only crash occasionally in start_reading.

Nathanaël Lécaudé · Answer 22 · Sat Jun 04 2022 06:16:12 GMT+0800 (China Standard Time)

Here's the stack I got, it seems to still be in stop_reading:

04 Jun 06:57:17 2022 - startReading at 50 50 duty cycle
04 Jun 06:57:21 2022 - stopReading
04 Jun 06:57:25 2022 - startReading at 50 50 duty cycle
04 Jun 06:57:29 2022 - stopReading
04 Jun 06:57:33 2022 - startReading at 50 50 duty cycle
04 Jun 06:57:37 2022 - stopReading
04 Jun 06:57:41 2022 - startReading at 50 50 duty cycle
04 Jun 06:57:45 2022 - stopReading
04 Jun 06:57:49 2022 - startReading at 50 50 duty cycle
^C
Thread 1 "python3" received signal SIGINT, Interrupt.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
46      ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0x76ec308e in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x3a382c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#2  __pthread_cond_wait_common (abstime=0x0, mutex=0x3a3758, cond=0x3a3800) at pthread_cond_wait.c:502
#3  __pthread_cond_wait (cond=0x3a3800, mutex=0x3a3758) at pthread_cond_wait.c:655
#4  0x769f8682 in TMR_stopReading () from /usr/lib/python3/dist-packages/mercury.cpython-36m-arm-linux-gnueabihf.so
#5  0x769f371c in ?? () from /usr/lib/python3/dist-packages/mercury.cpython-36m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Nathanaël Lécaudé · Answer 23 · Mon Jun 06 2022 22:29:19 GMT+0800 (China Standard Time)

I tried official build and it indeed crashes sooner (with the same stack trace as above)
The newer code crashes also but it takes more time. So definitely an improvement although not 100% stable yet.

charlescochran · Answer 24 · Mon Jun 06 2022 22:46:45 GMT+0800 (China Standard Time)

Interesting. If I run into a crash with the new code, I will let you know.

charlescochran · Answer 25 · Mon Jun 06 2022 23:40:33 GMT+0800 (China Standard Time)

Also, if I may ask, what is your reason for starting/stopping the radio on a regular interval? I know this should be fixable, but one reason for wanting to do this is to allow periodic calls to get_temperature(). After my change, this should be no longer be necessary, as the stats callback now works for all asynchronous reads, not just the first one.

Nathanaël Lécaudé · Answer 26 · Mon Jun 06 2022 23:51:19 GMT+0800 (China Standard Time)

On our side it's because we need to write to the tag also so before writing we need to stop reading.

Nathanaël Lécaudé · Answer 27 · Mon Jun 06 2022 23:54:43 GMT+0800 (China Standard Time)

By the way, on our side we're getting the temperature by enabling stats (the reader.enable_stats method).

charlescochran · Answer 28 · Tue Jun 07 2022 00:10:55 GMT+0800 (China Standard Time)

With the original code (without my code), does stats work for you after the first time you restarting reading? For me, with the original code, the stats callbacks would simply stop getting called after restarting the reading for the first time. For me, my code fixes this.

Nathanaël Lécaudé · Answer 29 · Tue Jun 07 2022 00:13:25 GMT+0800 (China Standard Time)

It does seem to work on our side,
Do you mean that it works the first time you start the reader, then if you call stop and start it doesn't work anymore ?

Nathanaël Lécaudé · Answer 30 · Tue Jun 07 2022 00:54:18 GMT+0800 (China Standard Time)

Ok that's strange, never had that issue here.

charlescochran · Answer 31 · Sat Jun 25 2022 03:34:32 GMT+0800 (China Standard Time)

@natcl I may have found the same deadlock you are experiencing with my code. This deadlock is directly related to an exception called "Buffer overflow", which occurs when asynchronous tag callbacks are not processing fast enough, leading to an excess (100+) of tag reads waiting to be processed.

If you don't mind, try enabling exception handling (if you haven't already) with Reader.enable_exception_handler() and print the exception in your exception callback. Then, see if a "Buffer overflow" (and/or "Timeout") is printed right before you deadlock. I would love to know if we are experiencing the same issue!

charlescochran · Answer 32 · Sat Jun 25 2022 04:16:56 GMT+0800 (China Standard Time)

By the way, you can easily trigger a "Buffer overflow" by adding a sleep (0.1 seconds should do it; 0.5 definitely will) to your read callback function.

Nathanaël Lécaudé · Answer 33 · Sat Jun 25 2022 08:16:42 GMT+0800 (China Standard Time)

Thanks, will give it a try soon !

charlescochran · Answer 34 · Thu Jun 30 2022 05:39:33 GMT+0800 (China Standard Time)

@natcl @gotthardp
I've investigated the buffer overflow error (caused by delays in the tag read callback). Ideally, the error would result in either a. the reader simply dropping tag reads until the buffer is no longer full (which is, I believe, the intended behavior) or b. the reader stopping its read and restarting it once once the buffer is no longer full. Unfortunately, neither of these occur. Due to a bug in the underlying MercuryAPI, a buffer overflow leaves the reader in a broken state, recoverable only by reinitialization. Manually stopping the read after a buffer overflow and before the "Timeout" error which occurs ~11 seconds later results in a deadlock. Nathanaël, I believe this is responsible for the deadlock you are still experiencing.
After determining the failure to properly handle buffer overflows to be in the underlying MercuryAPI rather than in python-mercuryapi, I realized that python-mercuryapi is using MercuryAPI v1.31.4.35, released in Jan 2020. Since then, there have been three releases. The latest, v1.35.2.72, was released earlier this month. Its release notes mentioned something that sounded like a fix (the addition of the TMR_flush method), so I upgraded versions. In v1.35.2.72, buffer overflows notably results in an exception ("CRC error") being repeatedly raised, and a "Timeout" error does not occur after ~11 seconds. Stopping the read, however, still results in a deadlock, and I am still unable to find a way of recovering from a buffer overflow without reinitializing the reader. The TMR_flush method does not help.
A buffer overflow can potentially be addressed by waiting for the "Timeout" exception to occur after ~11 seconds, stopping the read, reinitializing the reader, and starting the read, but this is not an ideal solution. The work-around I am adopting for my application is to attempt to prevent buffer overflows from occurring by no longer doing any I/O in my registered tag read callback function. Instead, I am simply storing the TagReadData object in a queue.Queue object, which is, in turn, processed and emptied by a different thread. Nathanaël, once you verify that your deadlock is preceded by a buffer overflow, taking a similar approach may allow you to proceed. This is, of course, also not an ideal solution. I may contact Jadak and explain the bug in hopes that it may be fixed in a subsequent version.
As the buffer overflow deadlock is, in fact, unrelated to the original SEGV and deadlock I fixed, I believe we should still merge my prior commit. At this time, I will not commit the upgrade to MercuryAPI v1.35.2.72, as this upgrade did not fix the buffer overflow issue.
NOTE for anyone interested in eventually merging an upgrade to MercuryAPI v1.35.2.72: this version adds a preprocessor macro called TMR_ENABLE_CRC, which is defined to be 0 in tm_config.h. Several #if TMR_ENABLE_CRC statements are added around existing sections of code. Since TMR_ENABLE_CRC is 0 rather than 1 by default, this has the effect of changing default behavior. In fact, this prevents me, using a M6e (serial) reader, from initializing my reader in the first place (trying to do so results in a "Timeout" exception). Redefining TMR_ENABLE_CRC to be 1 (which would require a patch to make default in python-mercuryapi) allows me to initialize a mercury.Reader object. Note, however, that after this is done, initialization still fails with a "Timeout" exception if the previous run experienced a buffer overflow error. This last behavior does not occur in v1.31.4.35.

charlescochran · Answer 35 · Thu Jul 07 2022 20:57:32 GMT+0800 (China Standard Time)

@natcl @gotthardp
Update: I submitted a bug report to Jadak. They told me that they were able to reproduce the issue and will let me know if there any updates regarding a solution.

Nathanaël Lécaudé · Answer 36 · Thu Jul 07 2022 21:09:42 GMT+0800 (China Standard Time)

@charlescochran thanks for the detailed testing !
That's interesting, on the site we are using the readers (around 60 of them) some devices will sometimes have timeout errors. That seems to happen only on specific devices and it doesn't seem to be related to tags being present or not. It's probably not the same issue though. I assume in the case you describe the buffer overflow happens only when there are tags being read ?
Let me know if it's still useful for me to test !

Thanks !

charlescochran · Answer 37 · Thu Jul 07 2022 21:19:32 GMT+0800 (China Standard Time)

Yes, I've found that timeout errors can occur for other reasons, for example in reader initialization. A buffer overflow seems to cause a timeout error (in v1.31.4.35, anyway), but other things can cause them too. I believe that the timeout error itself is not causing the deadlock.
I would still love you to see if a buffer overflow is occurring prior to your deadlock. It is true that this should only occur when tags are being read. Just make sure exception callbacks are enabled so you will actually see the error. As I mentioned earlier, you can verify that you will see it by reproducing it by adding a sleep() to your read callback function.

charlescochran · Answer 38 · Mon Feb 27 2023 21:21:09 GMT+0800 (China Standard Time)

All right, fix for the local issue (but not the upstream issue) has been merged to master. If you still encounter this bug, I recommend trying the latest master rather than the latest tag release.

Ola Tuvesson · Answer 39 · Fri Mar 03 2023 01:03:38 GMT+0800 (China Standard Time)

I just ran into this, I think for the first time (I'm on my fourth project using Jadak's hardware). Using MercuryAPI 1.31.4.35 and python-mercurapi 0.5.4 (without the recent fixes). Both start_reading and particularly stop_reading would randomly hang, usually (though not always) forever, taking the main application with it. The behaviour seemed to be related to the number of tags being in range. Adding an exception handler showed that the error was indeed a buffer overflow.

I solved it by running my readers as separate threads (I'm using two M6E Micro for this project) with all the heavy lifting in the main Python application thread. Seems to work ok, and the refactoring was an opportunity to improve the structure of my project, but I'm still interested to hear if/when there's an offical upstream fix. Big thanks to everyone involved in this project!

Edit: It seems I spoke too soon; the overflow still happens, only less rarely now that the buffer is being emptied faster. But given enough visible tags (a dozen will do) and enough time (a minute or so), the buffer will still fill up. I don't see any way I could possibly make the buffer clear faster, now that the readers have their own threads. It looks like I'll have to pull latest and rebuild, in the hope that 325efef will mitigate the issue.

Ola Tuvesson · Answer 40 · Fri Mar 03 2023 05:23:22 GMT+0800 (China Standard Time)

I downloaded MercuryAPI 1.35.2.72 and used it to build a pull of latest master (including 325efef), and I'm happy to report that this seems to have fixed the issue; no hangs experienced on either start_reading or stop_reading over several long read session with many visible tags. Looking good, thank you!

charlescochran · Answer 41 · Fri Mar 03 2023 07:02:45 GMT+0800 (China Standard Time)

@clickworkorange
Happy to hear of your success! I'm also glad you were able to get it working with 1.35.2.72. Hopefully my note above was helpful in that task! I think it would be great for this project to migrate to using version 1.35.2.72 by default, but I didn't want to take that on.

Ola Tuvesson · Answer 42 · Fri Mar 03 2023 19:13:51 GMT+0800 (China Standard Time)

@charlescochran This whole thread was very helpful - one of those rare cases where there is an exact and detailed bug report of the issue I'm having, including a solution. But I have to admit I did not read your comments re. 1.35.2.72; I just went with it and it seems to work without any modifications. My modules run ancient firmware though (01.09.01), which might have something to do with it. I don't have access to a Win64 machine so can't run Universal Reader Assistant (on my Win10 VM it sees the modules but cannot connect) to update the firmware.

Ola Tuvesson · Answer 43 · Fri Mar 03 2023 19:54:18 GMT+0800 (China Standard Time)

@charlescochran Ha! I think I spoke too soon again - I'm now getting extremely weird segfaults and even a kernel panic. It seems libc gets it's knickers in a twist:

python3: relocation error: python3: symbol __uflow version GLIBC_2.4 not defined in file libc.so.6 with link time reference

XXX lineno: 855, opcode: 0
Fatal Python error: initimport: can't import _frozen_importlib
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 851, in <module>
  File "<frozen importlib._bootstrap>", line 855, in _ImportLockContext
SystemError: unknown opcode
Aborted

This foe is beyond me, I fear.

charlescochran · Answer 44 · Fri Mar 03 2023 21:47:13 GMT+0800 (China Standard Time)

@clickworkorange Yikes!

I'm not quite sure if this related, but as mentioned in that same comment of mine above, there is still an upstream buffer overflow issue that I wasn't able to solve in this Python wrapper. I told Jadak about and they said they were able to reproduce it, so I expect them to release a fix for it eventually, but until that happens, I couldn't figure out a way for this library to be deadlock-free!

You seemed to be experiencing the buffer overflow issue earlier, which wasn't surprising to me. If this new stuff is something different, that's unfortunate. You could also try changing the preprocessor macro from 0 to 1, as mentioned above, to see if that changes things. It seems to affect a lot.

Old firmware could also play a role; I feel your pain having to have access to Windows!

Ola Tuvesson · Answer 45 · Thu Mar 09 2023 06:34:04 GMT+0800 (China Standard Time)

I finally managed to catch a segfault while running my Python app under gdb. It doesn't happen often, maybe one in every few hundred calls top stop_reading(). I'm far less familiar with analysing crashes in gdb than I should be, so only made a rudimentary trip up the trace:

Stopping RFID reader on tmr:///dev/ttyACM0

Thread 10 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x6e4ff440 (LWP 3302)]
0x001c2860 in PyObject_CallObject ()
(gdb) up
#1  0x72f3cdb8 in invoke_read_callback (reader=<optimized out>, cookie=<optimized out>, pdata=<optimized out>) at mercury.c:879
879	mercury.c: No such file or directory.
(gdb) up
#2  invoke_read_callback (reader=<optimized out>, pdata=<optimized out>, cookie=0x5ed3b8) at mercury.c:858
858	in mercury.c
(gdb) up
#3  0x72f481d8 in notify_read_listeners (reader=0x5ed3c0, trd=0x6d3005c0) at tm_reader_async.c:452
452	tm_reader_async.c: No such file or directory.
(gdb) up
#4  0x72f488dc in parse_tag_reads (arg=0x5ed3c0) at tm_reader_async.c:719
719	in tm_reader_async.c
(gdb) up
#5  0x76f49494 in start_thread (arg=0x6e4ff440) at pthread_create.c:486
486	pthread_create.c: No such file or directory.
(gdb) up
#6  0x76dbbdd8 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
73	../sysdeps/unix/sysv/linux/arm/clone.S: No such file or directory.
(gdb) up
Initial frame selected; you cannot go up.

Does this provide any clues?

Ola Tuvesson · Answer 46 · Thu Mar 23 2023 01:47:51 GMT+0800 (China Standard Time)

Chance that this happens during development: ~1%
Chance that this happens during first client demo: 200%