Optimization proposal
OlivierSohn opened this issue · comments
Hello,
I have a suggestion to improve the performance of the queues that are dynamically sized:
Each time an element is pushed or poped, there is a call to a modulo operator to put head and tail in the [0,size) range.
This modulo might be very fast in the case where the size is a constant power of 2 known at compile time (if the compiler converts this to a bitwise operation), but when the size is set a run-time, this modulo operator might be expensive...
Instead of using modulo we could simply keep head and tail in the range [0,size) (i.e reset them to 0 when they become equal to size).
You are right. Would you like to submit a pull request?
You are right. Would you like to submit a pull request?
I'm afraid I won't have the time to do it.
There is another subject where I might be able to do a pull request but I'll open a separate issue.
Currently the queues that dynamically allocate the buffer always round the size to the next power of 2. I guess the easier way is to replace % size
with & (size - 1)
.
Currently the queues that dynamically allocate the buffer always round the size to the next power of 2. I guess the easier way is to replace
% size
with& (size - 1)
.
Yes that’s a smart way of doing it!
But actually I noticed that for a queue of size 7 the size is rounded to 4096 (or at least 1024, I don’t remember exactly). I wonder if this is expected (to have a full page and avoid other objects being on the same page?) or not
I wonder if this is expected (to have a full page and avoid other objects being on the same page?) or not
The B
variants of the queue use dynamic memory allocation and their size is always rounded to the next power of 2.
Whereas non-B
variants has MINIMIZE_CONTENTION
parameter, which rounds the size to the next power of 2 when set to true
.
I am thinking of unifying all the queue variants into one queue with a template policy argument to make the interface more straight-forward. But for now, dynamic allocation is done by B
variants, non-atomic and move-only types are supported by 2
variants.
their size is always rounded to the next power of 2.
For example with:
AtomicQueueB2<int> q(7);
we have q.size_ == 4096
(It looks like it's rounded to the next power of 2 bigger than 4096)
For example with:
AtomicQueueB2<int> q(7);
we have q.size_ == 4096
You are quite right.
This currently is expected behaviour because the B
variants always use the power of 2 size and remap subsequent indexes to different cache lines to minimize contention. The second condition causes AtomicQueueB2<int>
minimum size to be 4096 elements. B
variants do not have MINIMIZE_CONTENTION
template parameter that controls this behaviour for simplicity.
However, you don't want to use AtomicQueueB2
with lock-free types, such as int
, because there is extra overhead for handling non-atomic types. Use AtomicQueueB
with lock-free types. Unless you cannot allocate a special NIL
value from your atomic T
range, that queues for atomic types require.
Currently the queues that dynamically allocate the buffer always round the size to the next power of 2. I guess the easier way is to replace
% size
with& (size - 1)
.Yes that’s a smart way of doing it!
Just double checked (I forget details rather soon). The B
versions do exactly that because their size is always a power of 2.
The non-B
versions use plain %
because size_
is known at compile time, so that the compiler can optimise modulo into &
for power-of-2 sizes or, otherwise, do more fancy tricks to optimise the modulo by a constant.
Oh ok so I misread the code, I didn't see that the code I was reading was for a specialized type...
I didn't know about the remapping technique, I'll try to understand how it works from the code. I winder if successive elements land on successive cache lines for example
Oh ok so I misread the code, I didn't see that the code I was reading was for a specialized type...
Well, the code grew organically, and there is a number of queue few versions, so it is easy to get off track. That is the fault of my current API entirely.
Have a look into benchmarks.cc
, struct QueueTypes
.
The queue versions are also documented in README, see Available containers are:.
I didn't know about the remapping technique, I'll try to understand how it works from the code. I winder if successive elements land on successive cache lines for example
This is mentioned in README section NOTES, however, I do see now that it should be better documented in the code and/or I should provide better docs with examples.
Very interesting!
Actually the way the benchmarks are done made me think of an issue I encountered once: some CPU actually reduce their clock when they get hot, so I guess that the first queues of the benchmark could benefit from a cooler CPU :)
I'm closing this, since it's already covered
Actually the way the benchmarks are done made me think of an issue I encountered once: some CPU actually reduce their clock when they get hot, so I guess that the first queues of the benchmark could benefit from a cooler CPU :)
I run benchmarks in the following way:
[max@supernova:~/src/atomic_queue] $ sudo cpupower frequency-set --related --governor performance
[max@supernova:~/src/atomic_queue] $ ./scripts/run-benchmarks.sh
On modern Intel CPUs (Haswell and above) with adequate cooling (I use liquid) and power supply that fixes the CPU frequency to its max turbo.
I read that AMD Zen2 is quite different with regards to its turbo behaviour, but I don't have access to that CPU play with it.