Possible spurious wakeup with Condvar
hniksic opened this issue · comments
The documentation of Condvar
says that there are no spurious wakeups. While working on a custom SPSC channel, I seem to have discovered that spurious wakeups occur. I believe I have managed to extract the issue into a minimal example which should (I hope) be easy to follow.
On this example, cargo test --release
triggers the assertion in wait_send
most of the time:
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use parking_lot::{Condvar, Mutex};
#[derive(Debug)]
struct Shared {
write_pos: AtomicU64,
// bool - whether sender is still live
send_update: (Mutex<bool>, Condvar),
}
#[derive(Debug)]
pub struct Sender {
shared: Arc<Shared>,
}
#[derive(Debug)]
pub struct Receiver {
read_pos: u64,
shared: Arc<Shared>,
}
pub fn channel() -> (Sender, Receiver) {
let shared = Arc::new(Shared {
write_pos: AtomicU64::new(0),
send_update: (Mutex::new(true), Condvar::new()),
});
let sender = Sender {
shared: Arc::clone(&shared),
};
let receiver = Receiver {
read_pos: 0,
shared,
};
(sender, receiver)
}
impl Sender {
pub fn send(&mut self, _item: u32) {
let write_pos = self.shared.write_pos.load(Ordering::SeqCst);
// here we'd write the item
let locked = self.shared.send_update.0.lock();
self.shared.write_pos.store(write_pos + 1, Ordering::SeqCst);
drop(locked);
self.shared.send_update.1.notify_one();
}
}
impl Drop for Sender {
fn drop(&mut self) {
println!("Sender::drop");
let mut sender_live = self.shared.send_update.0.lock();
*sender_live = false;
drop(sender_live);
self.shared.send_update.1.notify_one();
}
}
impl Receiver {
fn wait_send(&self) -> bool {
let mut sender_live = self.shared.send_update.0.lock();
if self.read_pos != self.shared.write_pos.load(Ordering::SeqCst) {
return true;
}
if !*sender_live {
return false;
}
self.shared.send_update.1.wait(&mut sender_live);
if self.read_pos == self.shared.write_pos.load(Ordering::SeqCst) {
assert!(!*sender_live);
return false;
}
true
}
}
impl Iterator for Receiver {
type Item = u32;
fn next(&mut self) -> Option<u32> {
if self.read_pos == self.shared.write_pos.load(Ordering::SeqCst) {
if !self.wait_send() {
return None;
}
}
self.read_pos += 1;
Some(42) // here we'd read the item
}
}
#[cfg(test)]
mod tests {
use super::channel;
#[test]
fn test_many() {
let cnt = 500_000;
let (mut tx, rx) = channel();
std::thread::spawn(move || {
for i in 0..cnt {
tx.send(i as u32);
}
});
assert_eq!(rx.count(), cnt);
}
}
fn main() {}
The idea is that (in real code) receiver checks if it's caught up with the sender. If so, it calls wait_send()
to wait for new items to be sent, or for the sender to be dropped. notify_one()
is invoked in only those two situations. So, barring spurious wakeups, after the call to Condvar::wait()
, either write_pos
should be incremented (and therefore different than self.read_pos
, which can't have changed), or the sender_live
bool should have been updated to false. However, the assert triggers, indicating that neither has occurred.
If I change the implementation to use a loop that allows for spurious wakeups, everything works fine:
fn wait_send(&self) -> bool {
let mut sender_live = self.shared.send_update.0.lock();
loop {
if self.read_pos != self.shared.write_pos.load(Ordering::SeqCst) {
return true;
}
if !*sender_live {
return false;
}
self.shared.send_update.1.wait(&mut sender_live);
}
}
Of course, you could say that this works because it doesn't contain an assertion in the first place. But we can add the equivalent assertion by asserting that the loop is executed only once - and if we add it, it triggers:
fn wait_send(&self) -> bool {
let mut sender_live = self.shared.send_update.0.lock();
let mut iters = 0;
loop {
if self.read_pos != self.shared.write_pos.load(Ordering::SeqCst) {
return true;
}
if !*sender_live {
return false;
}
self.shared.send_update.1.wait(&mut sender_live);
iters += 1;
assert!(iters != 2);
}
}
Am I using Condvar
the wrong way, or is there a flaw in my reasoning? Or is a spurious wakeup possible after all?
Tested with Rust 1.58.0 and parking-lot 0.12.0.
One interesting data point: I cannot reproduce the panic if I remove drop(locked)
and drop(sender_live)
, i.e. if I notify the condvar with the mutex held. So maybe that is the source of the issue?
I consider it good practice to notify after unlocking the mutex to avoid the "pessimization" mentioned here. (And I've never had a problem with it - until now.)
While the documentation don't mention it either way, I now see that all the examples of both parking_lot::Condvar
and std::sync::Condvar
documentation show notify_one
and notify_all
called with the lock held. This is in contrast to the cppreference whose example shows notify_one()
invoked after the lock has been released.
I still don't understand how (not) holding the lock can cause what looks like a spurious wakeup, but maybe it will help someone else understand the issue.
The spurious wakeup comes from your code because you don't hold the lock while calling notify_one
. Consider this ordering of operations:
- Sender grabs the lock, increments
write_pos
and then releases the lock. Receiver::next
sees thatwrite_pos
has moved. It incrementsread_pos
and then returns.Receiver::next
is called again. It seesread_pos == write_pos
and blocks on the Condvar.- The Sender from step 1 finally gets around to calling
notify_one
.
@Amanieu Thanks for the analysis, that settles the issue.
What do you think in general about the optimization of notifying the condvar after releasing the mutex? Is that something that you'd recommend for parking_lot::Condvar
?
parking_lot doesn't wake up the thread if the mutex associated with the Condvar is locked. That thread is instead requeued to wait on the mutex instead without waking it.