kinghajj / deque

A (mostly) lock-free concurrent work-stealing deque in Rust.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reduce usage of SeqCst ordering

lhecker opened this issue · comments

An initial benchmark using coio-rs, which uses this crate to implement work stealing coroutine scheduling, suggests that using acquire/release ordering instead of sequential-consistent improves performance of this deque by up to 700% and overall performance of coio by about 3%, even though deque should play such a small part of the rest of the program (using OS X 10.11, i7 3770).

I'd be willing to create a PR based on this gist: https://gist.github.com/Amanieu/7347121 which in turn is based on this paper: http://www.di.ens.fr/~zappa/readings/ppopp13.pdf if you'd be ok with merging my changes and pushing a new version to crate. 😊

I suggest you base your work on the latest version of my code, which you can find here: https://github.com/Amanieu/asyncplusplus/blob/master/src/work_steal_queue.h

The only tricky part is that my code, which is based on the PPoPP paper, does not handle shrinking. The addition of shrinking may affect the memory ordering constraints needed in the code.

Thanks @Amanieu! Your code is really quite a bit simpler than the one in this project... I hope I'm going to be capable of making correct changes to it. 😐

I do think that the shrinking does not affect the memory ordering since - if I understand it correctly - the swap_buffer() method in this code does not directly free the memory but marks it for deletion in the buffer pool instead - just like it's described in the original Chase-Lev paper. Unfortunately I cannot prove this assumption...

The problem isn't freeing the memory, it's that once a buffer is returned to the pool, it can be modified by other threads. If this happens in the middle of a steal, this could result in the steal returning an incorrect value.

The original Chase-Lev paper solves this by messing with the offsets during a shrink, but the given code only works with SeqCst ordering. The PPoPP paper uses more relaxed orderings but doesn't support using a buffer pool.

I think it would be worth the performance boost of avoiding SeqCst atomics to remove the buffer pool and simply keep old buffers around in a linked list, like my code does. Since the size of a buffer doubles each time it is grown, the unused buffers have a total size that is always less than the size of the current buffer. This means that memory usage is at most 2x, which I think is acceptable since the deque doesn't tend to grow very much in practice.

Could you elaborate on the reason why the PPoPP paper does not support buffer pools, apart from not explicitely mentioning it (but only if it's not too time consuming for you)? Oh and what do you mean with "at most 2x"? Wouldn't the overhead be more of start_value / 2 *(2^n - 2) Bytes after n allocations? That would be quite a bit more than just 2x, but more like 2^x. 😅

Also: I saw that the PPoPP paper uses a resize(). Is there any reason as to why it's not used to also shrink the buffer? I mean: If it's possible to increase the buffer size in the push method, why is it not possible to shrink it there?

If you compare the code in the original paper (in section 4, figures 7 & 8) with mine, you'll notice that they have extra atomic operations when shrinking and stealing. The original paper assumes that all atomic operations are SeqCst (Java volatile semantics). The PPoPP paper does not include those extra instructions, and therefore does not support reusing buffers in a pool: a buffer that is "free" must retain the same contents since a stealer might still be accessing it.

Now, we could try to derive the required memory ordering for these extra instructions ourselves, but I don't feel very comfortable doing that myself: last time I tried I ended up with a broken mess.

Regarding the memory usage, keeping old buffers around in a linked list only requires 2x more memory than an "ideal" system which could free old buffers immediately. This is easy to see if you look at the allocation sizes:

If you start with a buffer of 16 entries, this is what you get:

Current buffer size Total size of all buffers
16 16
32 48
64 112
128 240
256 496

The buffers are kept in a linked list, with each buffer pointing the the smaller buffer before it. They are all freed once the dequeue is destroyed, so the memory isn't leaked, it's just that it can't be reused for anything else since another thread may still be reading data from it.

Oh you're right... Since "current" is 16(2^n-1) and the "total" is 16*2^(n-1) the ratio converges to (2x-1)/x = 2 at n->∞. Sorry about that.

EDIT: I realized just now that freeing buffers in algorithms like this is actually quite a big problem... And why it's such a problem. So I guess that answers most of my questions.

I still wonder why reference counting buffers cannot be used though...

Reference counting would significantly hurt performance because it would require an atomic increment and an atomic decrement for each steal operation. This would be needed to indicate that you are currently accessing a buffer.

Memory reclamation for lock free algorithm is a huge area of research. Just try searching for "lock free memory reclamation" and you will get a ton of results.

I can start working on a pull request to implement all these changes if you haven't done so already.

Well it would be only a relaxed increment, a release decrement and a acquire fence before freeing the buffer. I think the cost for this should be significantly less on most systems compared to "Seq.-Cst. Everywhere". But I think after everything I learned about this while thinking about possible solutions I must say: Your solution of using a linked list for the buffers is really nice and I really like it for it's simplicity.

On x86, atomic add/sub is always SeqCst because it uses the lock xadd instruction. On other platforms, such as ARM, atomic read-modify-write instructions are significantly more expensive than simple atomic load/stores which are implemented using a normal load/store + an optional memory barrier.

I see… I didn't knew about those differences and simply assumed that the memory orderings settings for atomic APIs fulfill only the choosen guarantees, especially since the performance characteristics of loads/stores seemingly matched my assumptions. Again: Thanks for enlightening me! 🙂
I think I'll really start reading more and more deeply into the characteristics of atomics as soon as possible… 😅