spin-locking

Question

spin-locking

Andriy06 opened this issue 7 months ago · comments

I'd like to raise a concern about spin-locking in atomic_queue implementation. I read README section about this, but I'm still not convinced. Let me describe an issue I saw:
An NDA platform with a non-fair OS scheduler that runs on millions of consumer devices. We had a spin-lock that was using _mm_pause/__yield instructions, same as atomic_queue. Which resulted in a low but statistically signficant number of deadlocks on only that platform. All cores were occupied by spinning threads, while the thread that was supposed to unlock them was never re-scheduled. The culprit here is a non-fair scheduler. It was giving a CPU core to a thread until it got blocked (e.g. on a mutex) or a higher-priority thread got ready for execution.
The fix was to add std::this_thread::yield() (not exactly but doing the same) into spin-locking. This solved the problem because while it wasn't communicating to the OS which thread it waits for, it was freeing a CPU core, and that was enough to eventually schedule the unlocking thread.
Now, std::this_thread::yield() is order of magnitudes slower than _mm_pause, at least on multiple platforms where I had a chance to benchmark it. A typical solution is to use a hybrid approach: spin on _mm_pause/__yield for a while (hundreds of times) for low-latency response and only then switch to spinning on std::this_thread::yield() as the last resort. This way it doesn't matter how slow it is.
To be clear, I don't think it's a reliable solution. I suspect it can be still vulnerable to the priority inversion problem, though I don't have data to back this claim. To avoid priority inversion, a proper blocking is required. So the most versatile solution I can think of would be:

spin on _mm_pause for N iterations;
spin on std::this_thread::yield() for another M iterations;
block the thread, e.g. by std::atomic<T>::wait(). right, I remember this library is C++14, so it would be rather std::mutex.
The cost is that unlocking thread would need to check every time if there's a waiting thread that needs to be notified (a relaxed load), and 8 bytes of memory. The implementation would be a bit convoluted though. So, from practical point of view, I'd go with just adding std::this_thread::yield() to the spinning. it's simple enough, costs nothing and solves a real problem in many real cases.

Maxim Egorushkin · Answer 1 · Thu Nov 16 2023 19:41:10 GMT+0800 (China Standard Time)

We had a spin-lock ... All cores were occupied by spinning threads, while the thread that was supposed to unlock them was never re-scheduled.

atomic_queue has a spin-lock per each queue element, which is shared between one producer thread and one consumer thread of that element only. Which differs from your scenario of all threads locking one same spinlock, is seems. 2 threads modifying 1 queue element is the extreme minimum contention, all threads modifying 1 spinlock is the extreme maximum contention.

The culprit here is a non-fair scheduler. It was giving a CPU core to a thread until it got blocked (e.g. on a mutex) or a higher-priority thread got ready for execution.

That sounds similar to Linux real-time FIFO thread scheduling.

Linux does real-time thread throttling to prevent exactly this problem of real-time FIFO threads never descheduling themselves off the CPUs and, in the worst case, hogging all CPUs and preventing all other threads/processes from making any progress.

The fix was to add std::this_thread::yield() (not exactly but doing the same) into spin-locking.

The root cause of the deadlocks you encountered is the indefinite suspension of the thread holding the spinlock by a higher-priority real-time FIFO thread which ends up busy-waiting on the spinlock and never deschedule off the CPU, aka priority inversion, if I understand you correctly.

Calling std::this_thread::yield() doesn't fix the priority inversion problem, the problem still persists. Rather it performs a corrective action after priority inversion wasted enough CPU cycles blocking forward progress of at least 2 threads so much, so that it became apparent that something has to be done.

On Linux, std::this_thread::yield/sched_yield suspends the calling thread to run another thread of the same priority only, if any, it doesn't resume lower priority threads, and, hence, cannot help with priority inversion. See notes section in man sched_yield. You may also like to peruse that Linux kernel thread about sched_yield not being the right solution for such problems, which the README defers to.

Coping with priority inversion is outside the scope of lockless algorithms, which only guarantee forward progress of all or some other threads modifying the same queue object when one thread was suspended in the middle of the queue modification operation.

Fair scheduling without priority inversion / indefinite thread suspension is a prerequisite for atomic_queue.

IMO, the right fix for your deadlocks can be either of:

Eliminate priority inversion by using one same fixed scheduling priority for all threads modifying the queue. Linux real-time scheduling classes FIFO and RR have strictly fixed priorities. RR class also has a time quantum which prevents indefinite suspension of threads of the same priority when there are no other available CPUs.
Use a wait-free queue with forward progress guarantees for all other threads when one thread was suspended in the middle of the queue modification operation.
Use platform-specific synchronization primitives designed to cope and not deadlock when priority inversion is in effect.
Use OS-specific features to avoid thread interruption and suspension while inside critical sections in user-space. Solaris did that.