Possible spurious wakeup with Condvar

Question

Possible spurious wakeup with Condvar

hniksic opened this issue 2 years ago · comments

The documentation of Condvar says that there are no spurious wakeups. While working on a custom SPSC channel, I seem to have discovered that spurious wakeups occur. I believe I have managed to extract the issue into a minimal example which should (I hope) be easy to follow.

On this example, cargo test --release triggers the assertion in wait_send most of the time:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;

use parking_lot::{Condvar, Mutex};

#[derive(Debug)]
struct Shared {
    write_pos: AtomicU64,
    // bool - whether sender is still live
    send_update: (Mutex<bool>, Condvar),
}

#[derive(Debug)]
pub struct Sender {
    shared: Arc<Shared>,
}

#[derive(Debug)]
pub struct Receiver {
    read_pos: u64,
    shared: Arc<Shared>,
}

pub fn channel() -> (Sender, Receiver) {
    let shared = Arc::new(Shared {
        write_pos: AtomicU64::new(0),
        send_update: (Mutex::new(true), Condvar::new()),
    });
    let sender = Sender {
        shared: Arc::clone(&shared),
    };
    let receiver = Receiver {
        read_pos: 0,
        shared,
    };
    (sender, receiver)
}

impl Sender {
    pub fn send(&mut self, _item: u32) {
        let write_pos = self.shared.write_pos.load(Ordering::SeqCst);
        // here we'd write the item
        let locked = self.shared.send_update.0.lock();
        self.shared.write_pos.store(write_pos + 1, Ordering::SeqCst);
        drop(locked);
        self.shared.send_update.1.notify_one();
    }
}

impl Drop for Sender {
    fn drop(&mut self) {
        println!("Sender::drop");
        let mut sender_live = self.shared.send_update.0.lock();
        *sender_live = false;
        drop(sender_live);
        self.shared.send_update.1.notify_one();
    }
}

impl Receiver {
    fn wait_send(&self) -> bool {
        let mut sender_live = self.shared.send_update.0.lock();
        if self.read_pos != self.shared.write_pos.load(Ordering::SeqCst) {
            return true;
        }
        if !*sender_live {
            return false;
        }
        self.shared.send_update.1.wait(&mut sender_live);
        if self.read_pos == self.shared.write_pos.load(Ordering::SeqCst) {
            assert!(!*sender_live);
            return false;
        }
        true
    }
}

impl Iterator for Receiver {
    type Item = u32;

    fn next(&mut self) -> Option<u32> {
        if self.read_pos == self.shared.write_pos.load(Ordering::SeqCst) {
            if !self.wait_send() {
                return None;
            }
        }
        self.read_pos += 1;
        Some(42) // here we'd read the item
    }
}

#[cfg(test)]
mod tests {
    use super::channel;

    #[test]
    fn test_many() {
        let cnt = 500_000;
        let (mut tx, rx) = channel();
        std::thread::spawn(move || {
            for i in 0..cnt {
                tx.send(i as u32);
            }
        });
        assert_eq!(rx.count(), cnt);
    }
}

fn main() {}

The idea is that (in real code) receiver checks if it's caught up with the sender. If so, it calls wait_send() to wait for new items to be sent, or for the sender to be dropped. notify_one() is invoked in only those two situations. So, barring spurious wakeups, after the call to Condvar::wait(), either write_pos should be incremented (and therefore different than self.read_pos, which can't have changed), or the sender_live bool should have been updated to false. However, the assert triggers, indicating that neither has occurred.

If I change the implementation to use a loop that allows for spurious wakeups, everything works fine:

    fn wait_send(&self) -> bool {
        let mut sender_live = self.shared.send_update.0.lock();
        loop {
            if self.read_pos != self.shared.write_pos.load(Ordering::SeqCst) {
                return true;
            }
            if !*sender_live {
                return false;
            }
            self.shared.send_update.1.wait(&mut sender_live);
        }
    }

Of course, you could say that this works because it doesn't contain an assertion in the first place. But we can add the equivalent assertion by asserting that the loop is executed only once - and if we add it, it triggers:

    fn wait_send(&self) -> bool {
        let mut sender_live = self.shared.send_update.0.lock();
        let mut iters = 0;
        loop {
            if self.read_pos != self.shared.write_pos.load(Ordering::SeqCst) {
                return true;
            }
            if !*sender_live {
                return false;
            }
            self.shared.send_update.1.wait(&mut sender_live);
            iters += 1;
            assert!(iters != 2);
        }
    }

Am I using Condvar the wrong way, or is there a flaw in my reasoning? Or is a spurious wakeup possible after all?

Tested with Rust 1.58.0 and parking-lot 0.12.0.

Hrvoje Nikšić · Answer 1 · Mon Feb 14 2022 06:09:25 GMT+0800 (China Standard Time)

One interesting data point: I cannot reproduce the panic if I remove drop(locked) and drop(sender_live), i.e. if I notify the condvar with the mutex held. So maybe that is the source of the issue?

I consider it good practice to notify after unlocking the mutex to avoid the "pessimization" mentioned here. (And I've never had a problem with it - until now.)

While the documentation don't mention it either way, I now see that all the examples of both parking_lot::Condvar and std::sync::Condvar documentation show notify_one and notify_all called with the lock held. This is in contrast to the cppreference whose example shows notify_one() invoked after the lock has been released.

I still don't understand how (not) holding the lock can cause what looks like a spurious wakeup, but maybe it will help someone else understand the issue.

Amanieu d'Antras · Answer 2 · Tue Feb 15 2022 00:56:43 GMT+0800 (China Standard Time)

The spurious wakeup comes from your code because you don't hold the lock while calling notify_one. Consider this ordering of operations:

Sender grabs the lock, increments write_pos and then releases the lock.
Receiver::next sees that write_pos has moved. It increments read_pos and then returns.
Receiver::next is called again. It sees read_pos == write_pos and blocks on the Condvar.
The Sender from step 1 finally gets around to calling notify_one.

Hrvoje Nikšić · Answer 3 · Tue Feb 15 2022 02:56:29 GMT+0800 (China Standard Time)

@Amanieu Thanks for the analysis, that settles the issue.

What do you think in general about the optimization of notifying the condvar after releasing the mutex? Is that something that you'd recommend for parking_lot::Condvar?

Amanieu d'Antras · Answer 4 · Tue Feb 15 2022 03:44:40 GMT+0800 (China Standard Time)

parking_lot doesn't wake up the thread if the mutex associated with the Condvar is locked. That thread is instead requeued to wait on the mutex instead without waking it.