sort could be faster

Question

sort could be faster

odinthenerd opened this issue 8 years ago · comments

Ok I admit its my pet peeve ;) I find it interesting how combining algorithms are not efficient in runtime stuff but are efficient in metaprogramming. Basically the current implementation is good to about 500 elements, however even with the fast tracks merging 16 elements into 500 elements will be slow.

Coming back to the original partition implementation the problem was that a bad pivot element would cause a partition of everything only to eliminate one element from the input list. If we were to build up a large list using the merge/insertion sort, split it in the middle and then use that as a pivot at least we are eliminating half of the output list each partition even if we are not eliminating anything from the input list.

Another approach would be to sort chunks of lets say 64 elements each and then "partition"-ing them would essentially be a "split_if" rather than a normal partition. Since we need to presort the lists for merging any way we aren't really losing much and splitting a sorted list should be more efficient than a classic partition on all of its elements. This would essentially be a pure merge sort combined with a split_if.

I would really like to get to 1000 elements for kvasir.

Odin Holmes commented 8 years ago

Clang 3.7

Odin Holmes · Answer 1 · Wed Aug 24 2016 20:23:02 GMT+0800 (China Standard Time)

@jonathanpoelen OMG your sort works up to several thousand elements! You are my hero!

Odin Holmes · Answer 2 · Wed Aug 24 2016 22:46:26 GMT+0800 (China Standard Time)

I think the algorithm is taking most of its time in merge. Problem is that merge is hopelessly recursive, I think we need to do some partitioning to shorten merge runs.

Odin Holmes · Answer 3 · Wed Aug 24 2016 23:18:52 GMT+0800 (China Standard Time)

I added a partition here https://github.com/edouarda/brigand/blob/master/brigand/algorithms/sort.hpp#L191 by splitting our 256 element sorted list and using the middle as a pivot element to partition the rest of the input as described above. The result is a 28% speed up on my machine. I think this is the right direction.

Odin Holmes · Answer 4 · Thu Aug 25 2016 03:30:31 GMT+0800 (China Standard Time)

here is the fast tracked version https://github.com/porkybrain/brigand/tree/merge-partition-sort

on my machine 2000 !!! element list in random order
old : 12.81s
new : 6.67s

not sure yet if I broke any thing, there are a lot of moving parts in this algorithm.

Bruno Dutra · Answer 5 · Thu Aug 25 2016 03:41:29 GMT+0800 (China Standard Time)

Out of curiosity, have you benchmarked metal's sort (on branch develop)? It is not directly fast tracked or anything fancy, so I don't expect it to beat brigand's, but on my machine it sorts a 1000 elements in around 4 seconds and all it takes is just about 20 lines of code, so perhaps it could be much simplified and still remain fast? Don't get me wrong, it's just that I flinch when I see that much fast tracking and the verbosity it entails.

Odin Holmes · Answer 6 · Thu Aug 25 2016 06:44:35 GMT+0800 (China Standard Time)

Wow I wasn't aware that you had improved it that much. I tested metal several months ago and it was much slower but that sounds quite similar to brigand and, as most of your stuff is, solved in a puristic fashion. I think we can slim down the brigand algorithm quite a lot.

Bruno Dutra · Answer 7 · Thu Aug 25 2016 07:58:43 GMT+0800 (China Standard Time)

I wasn't aware that you had improved it that much

Since our last discussions metal's API became more stable, so I decided to investigate which algorithms would benefit most from fast tracking and it basically boiled down to join, fold and reverse (brunocodutra/metal#41).

sounds quite similar to brigand

Take it with a grain of salt until we compare results on the same machine, because we might be comparing apples to bananas here depending on how different our hardware and setups are, but at a first glance it looks like metal doesn't lag too much behind brigand, which makes me proud to be honest.

Odin Holmes · Answer 8 · Thu Aug 25 2016 17:56:17 GMT+0800 (China Standard Time)

either you dev system is awesome or I'm doing something wrong, here is my code:

template<typename T, typename U>
using eager_less = meta::number<(T::value < U::value)>;
using a = typename metal::sort<my_list, metal::lambda<eager_less>>::type;

using brigand I get 0.55s for 300 elements and 15.26s for meta. Meta crashes my laptop at 400 elements, brigand goes to like 3000.

I'm thinking about a making a generic "chunker" which splits a list into smaller lists of a fixed size (like chunker16 or chunker256) which would allow us to fast track and still use generic fold and would save us from writing so damn many typenames in every algorithm.

Bruno Dutra · Answer 9 · Thu Aug 25 2016 18:17:25 GMT+0800 (China Standard Time)

I suppose you meant metal::number c: Also, Metal has been eager for a while, so that typenane ::type there is redundant. Finally, check whether results vary if you use metal::less instead. Later tonight I'll try to benchmark brigand's sort on my machine and let you know. Also what are you using for an input list? My numbers were based on a ordered list being sorted with greater (i.e. reversing it).

Odin Holmes · Answer 10 · Thu Aug 25 2016 22:05:51 GMT+0800 (China Standard Time)

Ok now I'm using a list of ransom metal::number and using a = metal::sort<my_list, metal::lambda<metal::less>> and the numbers are slightly worse than last run.

Bruno Dutra · Answer 11 · Thu Aug 25 2016 22:27:08 GMT+0800 (China Standard Time)

I have been blindly using a list ordered in reverse order as a worst case test for sort, but that obviously doesn't apply for merge sort, now that I take a closer look at it. Silly me, looks like I have been inadvertedly cheating :c

Curiously, that means merging is the culprit, so perhaps it deserves fast tracking. I'll see what I can lern from Brigand's.

Bruno Dutra · Answer 12 · Thu Aug 25 2016 22:27:37 GMT+0800 (China Standard Time)

Thanks for benchmarking btw!

Odin Holmes · Answer 13 · Thu Aug 25 2016 23:07:15 GMT+0800 (China Standard Time)

no problem, @ldionne made the same benchmarking mistake a while back ;) fast tracking merge is pretty hard because it is essentially a proper fold, @jonathanpoelen's idea of making a kind of partial merge seems to be the key behind the last speed up and is a genius idea. splitting the merged list and using that element as a pivot in order to partition the rest of input seems to bring another speed up (can be seen in my unmerged branch) but I think that can be improved on further.

Bruno Dutra · Answer 14 · Thu Aug 25 2016 23:15:27 GMT+0800 (China Standard Time)

I was just taking a look at his implementation of merge and that partial fast tracking is pretty ingenious indeed. I'm still not convinced it is necessary though. The fact metal's sort crashes your machine means the memory is overflowing and that usually stems from recursive partial template specialization that can often be worked around. I have some ideas I still want to give a shot at before going fast tracking. I'll let you know if I get anywhere.

Odin Holmes · Answer 15 · Thu Aug 25 2016 23:24:41 GMT+0800 (China Standard Time)

Nice to see attention to sort, it is really something I need done well for kvasir. I would love the hear original ideas!

Bruno Dutra · Answer 16 · Fri Aug 26 2016 00:43:27 GMT+0800 (China Standard Time)

BTW, which compiler are you using for these benchmarks?

Bruno Dutra · Answer 17 · Fri Aug 26 2016 22:25:08 GMT+0800 (China Standard Time)

So I was able to pin join as the culprit for the memory overflow. I even managed to rewrite it so as to reduce memory consumption by about 20-30% on gcc, but it is still prohibitively high.

If only there was a way to implement take directly just like one implements drop, divide-and-conquer algorithms wouldn't have to depend on join (nor on any linearly recursive algorithm for what's worth).

Bruno Dutra · Answer 18 · Sat Aug 27 2016 09:09:30 GMT+0800 (China Standard Time)

Finally, I decided to implement merge the dumbest way possible and, to my surprise, that's what happens

gcc 6.1	clang 4.0

Odin Holmes · Answer 19 · Wed Aug 31 2016 17:05:09 GMT+0800 (China Standard Time)

I finally found time to look at your solution. I think one thing your doing that is faster than us is your merge specialization struct _merge<ret, list<xh, xt...>, list<yh, yt...>, lambda<expr>, if_<expr<xh, yh>, true_, false_>> where you know that the last parameter is going to be true_. The problem is that this restricts the concept of a predicate in that it must return a metal::number. Currently (as far as I know) brigand will accept anything that has a ::value as a predicate, which is far less restrictive. I wonder if our loose definition of this is causing a performance hit.

Bruno Dutra · Answer 20 · Wed Aug 31 2016 19:05:25 GMT+0800 (China Standard Time)

Well observed, that is a distinctive feature of Metal. It is very strict with respect to concepts and requirements, which although often requires extra overhead (e.g SFINAE friendliness not matter what), sometimes also opens room for performance improvements as well. I'm not fond of overly lax requirements, because I feel that makes concepts harder to grasp and leads the user to unexpected surprises. And maybe because I'm a little bit too obsessed with symmetry.

Odin Holmes · Answer 21 · Mon Jan 30 2017 19:59:47 GMT+0800 (China Standard Time)

have not been able to put a whole lot of time into this but brigand::sort could benefit from chunking the input using a meta monad pattern as well as joining many lists rather than 2 at a time. Join can be implemented using more aliases and less complex types as well, see metals implementation of join for a reference. Also the nested alias version of conditional should help too.

Just wanted to keep the thought process going.