jonhoo / haphazard

Hazard pointers in Rust.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

No reload after cas fail in check_count_threshold causing infinite loop

TroyNeubauer opened this issue · comments

I have been benchmarking haphazard to obtain data comparing multiple shard hash functions for a paper I'm writing in my statistics class, and I think I found a bug inside check_count_threshold in domain.rs if the stars align in a concurrent program:

haphazard/src/domain.rs

Lines 319 to 333 in a06cf1a

fn check_count_threshold(&self) -> isize {
let rcount = self.count.load(Ordering::Acquire);
while rcount > self.threshold() {
if self
.count
.compare_exchange_weak(rcount, 0, Ordering::AcqRel, Ordering::Relaxed)
.is_ok()
{
self.due_time
.store(Self::now() + SYNC_TIME_PERIOD, Ordering::Release);
return rcount;
}
}
0
}

If the first CAS fails because the value was changed by another thread before being swapped, count is never re-read, so all future cases will fail, causing the current thread to loop until other threads retire enough objects for count to increase to the original value loaded at the beginning of the function.
This matches what the folly code does, I think to make sense in most scenarios: Assume say we have two threads A and B, which have called check_count_threshold concurrently. Lets say A successfully cased count to 0. Thread A will immediately enter do_reclimanation with the old (probably large) value of rcount. Then the cas on thread B will fail because the cas succeeded for thread A. If thread B simply re-read the updated value from the Err variant and tried again, count would most likely be 0 or a very small number as thread A just set it to 0. Thread B will then run do_recrimination, trying to reclaim a small number of objects, concurrently with A, likely leading to contention of other cases.

I believe the folly developers did this so that thread B will spin loop until more objects have been retired before the cas is allowed to succeed, and this works fine if no threads stop running.
My problem is that for my stats project I run many iterations of: creating a new domain, starting threads to do work, measuring the performance, and stopping the threads. So if threads A and B race to cas count to 0, and A wins, then immediately exits, thread B spins forever. I'm still not completely sure if this is a misuse of the library on my part, or a real use case for this library that should be fixed.

The folly code for check_threshold_and_reclaim: https://github.com/facebook/folly/blob/08d98365b5abe207b879f1369a05bd0fd67acd85/folly/synchronization/HazptrDomain.h#L390
My benchmark: https://github.com/TroyNeubauer/haphazard/blob/9f505167de65dd46114a9c080625ab8af71172ca/stats_bench/src/main.rs
This is what I have changed check_threshold_and_reclaim in my fork to be while I write this stats paper 😅:

    fn check_count_threshold(&self) -> isize {
        let mut rcount = self.count.load(Ordering::Acquire);
        while rcount > self.threshold() {
            match self
                .count
                .compare_exchange_weak(rcount, 0, Ordering::AcqRel, Ordering::Relaxed)
            {
                Ok(_) => {
                    #[cfg(feature = "std")]
                    self.due_time
                        .store(Self::now() + SYNC_TIME_PERIOD, Ordering::Release);
                    return rcount;
                }
                Err(rcount_now) => {
                    rcount = rcount_now;
                }
            }
        }
        0
    }

C++'s compare_exchange takes the reference to the expected value and stores the actual value to that reference in the failure case. https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange

So folly's implementation matches your version.

Good catch, and thanks for the clarifiation @tomtomjhj — I had totally missed that! May be worth checking over our other compare_exchange invocations for the same issue elsewhere.

@TroyNeubauer would you mind submitting a PR with your change?

Will do! I'll look out for the other uses of compare_exchange too

Fixed in #14