Optimization proposal

Question

Optimization proposal

OlivierSohn opened this issue 5 years ago · comments

Hello,

I have a suggestion to improve the performance of the queues that are dynamically sized:

Each time an element is pushed or poped, there is a call to a modulo operator to put head and tail in the [0,size) range.

This modulo might be very fast in the case where the size is a constant power of 2 known at compile time (if the compiler converts this to a bitwise operation), but when the size is set a run-time, this modulo operator might be expensive...

Instead of using modulo we could simply keep head and tail in the range [0,size) (i.e reset them to 0 when they become equal to size).

Maxim Egorushkin · Answer 1 · Mon Nov 18 2019 01:01:45 GMT+0800 (China Standard Time)

You are right. Would you like to submit a pull request?

OlivierSohn · Answer 2 · Mon Nov 18 2019 02:25:46 GMT+0800 (China Standard Time)

You are right. Would you like to submit a pull request?

I'm afraid I won't have the time to do it.

There is another subject where I might be able to do a pull request but I'll open a separate issue.

Maxim Egorushkin · Answer 3 · Mon Nov 18 2019 21:03:11 GMT+0800 (China Standard Time)

Currently the queues that dynamically allocate the buffer always round the size to the next power of 2. I guess the easier way is to replace % size with & (size - 1).

OlivierSohn · Answer 4 · Tue Nov 19 2019 03:50:49 GMT+0800 (China Standard Time)

Currently the queues that dynamically allocate the buffer always round the size to the next power of 2. I guess the easier way is to replace % size with & (size - 1).

Yes that’s a smart way of doing it!

But actually I noticed that for a queue of size 7 the size is rounded to 4096 (or at least 1024, I don’t remember exactly). I wonder if this is expected (to have a full page and avoid other objects being on the same page?) or not

Maxim Egorushkin · Answer 5 · Tue Nov 19 2019 05:26:09 GMT+0800 (China Standard Time)

I wonder if this is expected (to have a full page and avoid other objects being on the same page?) or not

The B variants of the queue use dynamic memory allocation and their size is always rounded to the next power of 2.

Whereas non-B variants has MINIMIZE_CONTENTION parameter, which rounds the size to the next power of 2 when set to true.

Maxim Egorushkin · Answer 6 · Tue Nov 19 2019 05:28:08 GMT+0800 (China Standard Time)

I am thinking of unifying all the queue variants into one queue with a template policy argument to make the interface more straight-forward. But for now, dynamic allocation is done by B variants, non-atomic and move-only types are supported by 2 variants.

OlivierSohn · Answer 7 · Tue Nov 19 2019 05:54:41 GMT+0800 (China Standard Time)

their size is always rounded to the next power of 2.

For example with:
AtomicQueueB2<int> q(7);
we have q.size_ == 4096

(It looks like it's rounded to the next power of 2 bigger than 4096)

Maxim Egorushkin · Answer 8 · Tue Nov 19 2019 06:29:51 GMT+0800 (China Standard Time)

For example with:
AtomicQueueB2<int> q(7);
we have q.size_ == 4096

You are quite right.

This currently is expected behaviour because the B variants always use the power of 2 size and remap subsequent indexes to different cache lines to minimize contention. The second condition causes AtomicQueueB2<int> minimum size to be 4096 elements. B variants do not have MINIMIZE_CONTENTION template parameter that controls this behaviour for simplicity.

However, you don't want to use AtomicQueueB2 with lock-free types, such as int, because there is extra overhead for handling non-atomic types. Use AtomicQueueB with lock-free types. Unless you cannot allocate a special NIL value from your atomic T range, that queues for atomic types require.

Maxim Egorushkin · Answer 9 · Tue Nov 19 2019 06:41:52 GMT+0800 (China Standard Time)

Currently the queues that dynamically allocate the buffer always round the size to the next power of 2. I guess the easier way is to replace % size with & (size - 1).

Yes that’s a smart way of doing it!

Just double checked (I forget details rather soon). The B versions do exactly that because their size is always a power of 2.

The non-B versions use plain % because size_ is known at compile time, so that the compiler can optimise modulo into & for power-of-2 sizes or, otherwise, do more fancy tricks to optimise the modulo by a constant.

OlivierSohn · Answer 10 · Tue Nov 19 2019 06:54:25 GMT+0800 (China Standard Time)

Oh ok so I misread the code, I didn't see that the code I was reading was for a specialized type...

I didn't know about the remapping technique, I'll try to understand how it works from the code. I winder if successive elements land on successive cache lines for example

Maxim Egorushkin · Answer 11 · Tue Nov 19 2019 06:57:12 GMT+0800 (China Standard Time)

Oh ok so I misread the code, I didn't see that the code I was reading was for a specialized type...

Well, the code grew organically, and there is a number of queue few versions, so it is easy to get off track. That is the fault of my current API entirely.

Have a look into benchmarks.cc, struct QueueTypes.

The queue versions are also documented in README, see Available containers are:.

I didn't know about the remapping technique, I'll try to understand how it works from the code. I winder if successive elements land on successive cache lines for example

This is mentioned in README section NOTES, however, I do see now that it should be better documented in the code and/or I should provide better docs with examples.

OlivierSohn · Answer 12 · Tue Nov 19 2019 07:08:45 GMT+0800 (China Standard Time)

Very interesting!

Actually the way the benchmarks are done made me think of an issue I encountered once: some CPU actually reduce their clock when they get hot, so I guess that the first queues of the benchmark could benefit from a cooler CPU :)

OlivierSohn · Answer 13 · Tue Nov 19 2019 07:15:48 GMT+0800 (China Standard Time)

I'm closing this, since it's already covered

Maxim Egorushkin · Answer 14 · Tue Nov 19 2019 07:21:11 GMT+0800 (China Standard Time)

Actually the way the benchmarks are done made me think of an issue I encountered once: some CPU actually reduce their clock when they get hot, so I guess that the first queues of the benchmark could benefit from a cooler CPU :)

I run benchmarks in the following way:

[max@supernova:~/src/atomic_queue] $ sudo cpupower frequency-set --related --governor performance
[max@supernova:~/src/atomic_queue] $ ./scripts/run-benchmarks.sh

On modern Intel CPUs (Haswell and above) with adequate cooling (I use liquid) and power supply that fixes the CPU frequency to its max turbo.

I read that AMD Zen2 is quite different with regards to its turbo behaviour, but I don't have access to that CPU play with it.