thread barriers need backoff
jeffhammond opened this issue · comments
This code leads to serious problems when hardware threads are oversubscribed. Adding sched_yield()
reduces the problem by one or two orders of magnitude by triggering the kernel to swap threads so progress happens more quickly.
// If the current thread is NOT the last thread to have arrived, then
// it spins on the sense variable until that sense variable changes at
// which time these threads will exit the barrier.
while ( __atomic_load_n( &comm->barrier_sense, __ATOMIC_ACQUIRE ) == orig_sense )
; // Empty loop body.
I tried no-op instructions but those do not help in the oversubscribed case, because they don't trigger a context switch. Those backoffs are appropriate when memory access contention is the issue.
This is related but complementary to #603. This is a new version of #82.
My proposed fix will allow a user to disable sched_yield()
but I assert we need it enabled in the distribution builds of BLIS because quality-of-service is more important than the last bit of performance in the general case. Benchmarking use cases can disable it if it is expected to matter there.
References
@jeffhammond sched_yield
is too heavyweight and not portable enough to be used all the time. I will update #82 with a general framework for config-specific behavior and then we can start filling in the actual implementation.
@jeffhammond suggestions for any specific architectures?