No reload after cas fail in check_count_threshold causing infinite loop
TroyNeubauer opened this issue · comments
I have been benchmarking haphazard to obtain data comparing multiple shard hash functions for a paper I'm writing in my statistics class, and I think I found a bug inside check_count_threshold
in domain.rs
if the stars align in a concurrent program:
Lines 319 to 333 in a06cf1a
If the first CAS fails because the value was changed by another thread before being swapped, count
is never re-read, so all future cases will fail, causing the current thread to loop until other threads retire enough objects for count
to increase to the original value loaded at the beginning of the function.
This matches what the folly code does, I think to make sense in most scenarios: Assume say we have two threads A and B, which have called check_count_threshold
concurrently. Lets say A successfully cased count
to 0. Thread A will immediately enter do_reclimanation
with the old (probably large) value of rcount
. Then the cas on thread B will fail because the cas succeeded for thread A. If thread B simply re-read the updated value from the Err variant and tried again, count
would most likely be 0 or a very small number as thread A just set it to 0. Thread B will then run do_recrimination
, trying to reclaim a small number of objects, concurrently with A, likely leading to contention of other cases.
I believe the folly developers did this so that thread B will spin loop until more objects have been retired before the cas is allowed to succeed, and this works fine if no threads stop running.
My problem is that for my stats project I run many iterations of: creating a new domain, starting threads to do work, measuring the performance, and stopping the threads. So if threads A and B race to cas count
to 0, and A wins, then immediately exits, thread B spins forever. I'm still not completely sure if this is a misuse of the library on my part, or a real use case for this library that should be fixed.
The folly code for check_threshold_and_reclaim
: https://github.com/facebook/folly/blob/08d98365b5abe207b879f1369a05bd0fd67acd85/folly/synchronization/HazptrDomain.h#L390
My benchmark: https://github.com/TroyNeubauer/haphazard/blob/9f505167de65dd46114a9c080625ab8af71172ca/stats_bench/src/main.rs
This is what I have changed check_threshold_and_reclaim
in my fork to be while I write this stats paper 😅:
fn check_count_threshold(&self) -> isize {
let mut rcount = self.count.load(Ordering::Acquire);
while rcount > self.threshold() {
match self
.count
.compare_exchange_weak(rcount, 0, Ordering::AcqRel, Ordering::Relaxed)
{
Ok(_) => {
#[cfg(feature = "std")]
self.due_time
.store(Self::now() + SYNC_TIME_PERIOD, Ordering::Release);
return rcount;
}
Err(rcount_now) => {
rcount = rcount_now;
}
}
}
0
}
C++'s compare_exchange
takes the reference to the expected value and stores the actual value to that reference in the failure case. https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange
So folly's implementation matches your version.
Good catch, and thanks for the clarifiation @tomtomjhj — I had totally missed that! May be worth checking over our other compare_exchange
invocations for the same issue elsewhere.
@TroyNeubauer would you mind submitting a PR with your change?
Will do! I'll look out for the other uses of compare_exchange
too
Fixed in #14