On an M1 on MacOS, when profiling with Tracy, ReadWriterQueue's try_enqueue / try_dequeue sometimes seem to cause spikes of several milliseconds

Question

On an M1 on MacOS, when profiling with Tracy, ReadWriterQueue's try_enqueue / try_dequeue sometimes seem to cause spikes of several milliseconds

IliasBergstrom opened this issue 2 years ago · comments

Context:
I am working on an audio application, where the real-time requirement is that the "audio callback" has to conclude execution within e.g. 1.33ms (depending on settings), or there will be audible clicks due to "buffer underruns". So naturally, the ReaderWriterQueue is immensely useful to communicate with the audio thread!

I am running the application on an M1 mac, building with XCode (through CMake), a Release build with debug symbols and these flags set:
-DCMAKE_OSX_DEPLOYMENT_TARGET=11.0 -DCMAKE_OSX_ARCHITECTURES=arm64

Audible clicks appear - a symptom of "buffer underruns".

Issue:
I profiled the application using Tracy (which coincidentally also uses readerwriterqueue), and I see there, that invocations to ReaderWriterQueue's try_enqueue / try_dequeue sometimes seem to cause spikes of several milliseconds.

I've measured this by placing Tracy's "ZoneScoped" entries inside each of these methods, and then watching in Tracy for how long these calls take. While usually they take a few microseconds, occasionally I've measured up to 8 milliseconds, see e.g. here, for try_dequeue:

I will now look for an Intel mac, to run the same test and see if there's any difference.

Meanwhile, given the new M1 ARM processors are known to require "heavier" memory barriers, I wanted to ask:

Has anyone confirmed that ReadWriterQueue can perform under such tight timing constraints?
Or, alternatively, has anyone experienced the same as what I am describing here?

I am asking since on the one hand ReaderWriterQueue is widely used, but on the other hand, the README.md still states:

"Note that it's only been tested on x86(-64); if someone has access to other processors I'd love to run some tests on anything that's not x86-based."

Thank you!

Cameron · Answer 1 · Tue Nov 29 2022 21:59:16 GMT+0800 (China Standard Time)

There's no loop in try_dequeue, so whatever is causing the 1 ms delay is not executing code. Perhaps the thread was pre-empted?

Ilias Bergström · Answer 2 · Tue Nov 29 2022 22:58:10 GMT+0800 (China Standard Time)

Thank you Cameron for the quick reply!

That's a very good suggestion - I was assuming that these lock-free calls would always be fast irrespective of the thread they're invoked in (real-time audio thread or "worker"), but I think I'm setting myself up for false positives there. I will make the needed changes to my test setup so that only invocations from the real-time thread are registered!

Ilias Bergström · Answer 3 · Tue Nov 29 2022 23:35:04 GMT+0800 (China Standard Time)

I now ensured that I only log from within the real-time (audio) thread, and indeed the spikes are much fewer!

I still note some spikes though, so I will test tomorrow with an Intel Mac, and could also test with setting up Tracy directly with your benchmark to separate testing the queue code from my own on the M1.

Cameron · Answer 4 · Tue Nov 29 2022 23:37:47 GMT+0800 (China Standard Time)

The microsecond range can be explained by uncached (or cached but requiring cross-core synchronization) memory accesses, but the millisecond range has to be something else.

Ilias Bergström · Answer 5 · Thu Dec 01 2022 18:45:33 GMT+0800 (China Standard Time)

I ran the same code on an Intel mac now - and while I haven't managed to isolate the Tracy logging to the "real-time" audio thread only yet, I can already confirm that the behaviour is vastly different, it's indeed in the micro or nanosecond range.

I'll still need to implement a way to only log the real-time thread also on the Intel mac for the comparison to be direct - and to also do the test with Tracing your benchmark code, to rule out user error, which is of course still the most likely cause.

Ilias Bergström · Answer 6 · Mon Dec 05 2022 16:25:15 GMT+0800 (China Standard Time)

It was indeed that some threads were aggressively pre-empted on the M1 - By invoking the following for each such thread:
pthread_set_qos_class_self_np(QOS_CLASS_USER_INTERACTIVE, 0);
I got performance comparable to the Intel Mac I tested on.
Some performance issues still remain but I would be very surprised if they are caused by ReadWriterQueue - if they seem to be I'll re-open this, but for now it's best to close - Thank you!