haskell / containers

Assorted concrete container types

Home Page:https://hackage.haskell.org/package/containers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve powerSet performance

treeowl opened this issue · comments

Obviously, there's only so much we can do, but we can do some. The most obvious optimization is to use a version of insertMin that takes a singleton tree as an argument instead of an element. But I can't help wondering if we can do better. The problem is obviously $\Omega(2^n)$, and our solution is $O(2^n\log n)$. Can we close that gap by either improving our solution or proving a tight(er) lower bound?

@meooow25 , I started rambling about this in comments on your PR. Feel very welcome to think about it.

We can certainly improve the space requirement for the result to $O(2^n)$, though I don't yet know whether or we can do so without making the time blow up. In particular, each set in the result with $k \ge 3$ elements can be built by bining two sets in the result of equal or nearly equal size with an element. So the trick is to get the right sets and the right element where we need them when we need them.

Continuing to think out loud: imagine we build all the singletons first, then the 2-sets, then the 3-sets, etc. When we're working on the $k$-sets, we can hold on to the collections of $\lfloor k/2 \rfloor$-sets and $\lceil k/2 \rceil$-sets. The question is, how do we know which one to combine with which, and with which root element? I've drawn out the first few stages of this process, but it's hard to draw enough of it to look for patterns because the diagram grows exponentially! In case it helps anyone, here's a diagram for powerSet [1..5]. Each starting column represents a set size, and sets are lexicographically ordered from top to bottom.

[]
  [1]
    [1,2]
      [1,2,3]
        [1,2,3,4]
          [1,2,3,4,5]
        [1,2,3,5]
      [1,2,4]
        [1,2,4,5]
      [1,2,5]
    [1,3]
      [1,3,4]
        [1,3,4,5]
      [1,3,5]
    [1,4]
      [1,4,5]
    [1,5]
  [2]
    [2,3]
      [2,3,4]
        [2,3,4,5]
      [2,3,5]
    [2,4]
      [2,4,5]
    [2,5]
  [3]
    [3,4]
      [3,4,5]
    [3,5]
  [4]
    [4,5]
  [5]

For the next ... while ... I will assume that we're taking the power set of [1..n]. That can be generalized by tracking the position (in the original set) of the last element of a subset. Suppose we're building from columns j and k (which will either be identical or adjacent), so that each element we build will look like p ++ [e] ++ q, where p is from j and q is from k. We walk p through j from its beginning up until we reach a point where the largest element of p is too big to be able to choose e and still have enough room to take q. For each p, we need to be able to find the matching qs, and we have to do it efficiently. I don't think we can afford to walk k from the start for each p. So let's walk k once to break it up into lists by first element, and put those lists in an array indexed by first element. Now we can quickly find where the qs start for any given p.

The hardest remaining part seems to be fitting the collections of elements of each size together into a single Set. There's that wobbling back and forth in size. We have to know when to zig and when to zag.

@ekmett Do you have any ideas, either in the direction I'm pointing or another one?

I don't fully understand your last comment, but I can follow the ones before it. The diagram is also interesting.

Here has been my line of thought:

The current algorithm isn't useful in getting to $O(2^n)$ because of insertMin. If that were $O(1)$, we would be good. So this works in optimal time for lists, for instance, but not sets.

To construct sets in $O(1)$, we need left and right subtrees that are balanced wrt each other and the root element.
We can have an array :: Array Int (Set a) where the index is a bitmask denoting indices of the set of elements in the original set. Then we can build the sets for a mask by finding the middle set bit, taking that as root, then indexing into the array with the lower and upper parts of the mask for left and right subtrees.

If we find the middle bit by looping over the positions, we get $O(n 2^n)$ time but $O(2^n)$ memory which is nice.
If we instead binary search for the middle bit, given we have something like $O(1)$ popCount, we are back to $O(2^n \log n)$ time but still with $O(2^n)$ memory.
I don't know of a way to find the middle bit in $O(1)$, if we could do that we would be golden.

It's actually not difficult.

middleBit 1 = 0
middleBit n
  | even n    = middleBit (n `div` 2) + 1
  | otherwise = middleBit (n - 1)

We can memoize for $O(1)$.

So that's at least one way to build the power set in $O(2^n)$.
But I overlooked the fact that we need to convert from the mask order to the sorted order of sets. Trying to figure that out.

I don't understand what you mean about a middle bit.

A bit more explanation of where I was heading: For each suitable left child, we choose one or more suitable roots. For each such root, we choose one or more right children. For instance, suppose we're walking the collection of 2-sets and looking at the collection of 3-sets to form the collection of 6-sets (we could actually do it either way, depending on whether we want our trees to be heavier on the left or on the right). If we've walked up to [1,3] in the collection of 2-sets, then we need to consider, in turn, each root 4, 5, .... We can't use too big a root, because we need to have at least three greater elements in the right child, so we can stop considering roots once they get too big. Suppose we are at root 5, so we have a left subtree of [1,3] and a root of 5. We now need to use, as right children, all the 3-trees whose first element is at least 6. My array concept was intended to make it cheap to jump to those 3-trees, though I'm not sure if it's theoretically necessary.

Ah, the pattern is obvious now that I see it. Silly me! We always go up by one in length after a subset that does not include the maximum value. We always go down by one in length otherwise. So once we have all the subsets, getting them in order is actually easy. We can make them a set using fromDistinctAscList (or a version thereof that knows the size in advance). Is there a better way to form the set in the end? Maybe, but let's get to something that works first and think about constant factors (and GC effects) later.

Thanks for the explanation, I can see how you want to construct the sets, but I don't understand how it all ties together. Perhaps some code/pseudocode can make things clearer.

I don't understand what you mean about a middle bit.

The middle set bit of 0b11011001 is bit 4, for example (0-indexed). To construct the set for this mask, we take element 4 as the root, the set we constructed for 0b1001 as the left subtree and that for 0b11000000 as the right subtree.

Here is my idea implemented. Not benchmarked or optimized.

import Data.Bits (bit, xor, (.&.), popCount, finiteBitSize, countLeadingZeros)
import qualified Data.Array.Unboxed as UA
import qualified Data.List as L
import Data.Set.Internal

-- O(2^n)
powerSet :: forall a. Set a -> Set (Set a)
powerSet xs0 = fromDistinctAscList (L.map (sets UA.!) sortedMasks)
  where
    n = size xs0
    xs = UA.listArray (0, n - 1) (toList xs0) :: UA.Array Int a

    -- TODO: Use unboxed array for better constant factor
    middleBit :: UA.Array Int Int
    middleBit = UA.listArray (1, bit n - 1) (L.map f [1 .. bit n - 1])
      where
        f 1 = 0
        f msk
          | even msk            = (middleBit UA.! (msk `quot` 2)) + 1
          | even (popCount msk) = middleBit UA.! (msk - 1)
          | otherwise           = let x = middleBit UA.! (msk - 1)
                                      -- return the next lowest set bit after x
                                      msk' = msk .&. (bit x - 1)
                                  in finiteBitSize msk' - 1 - countLeadingZeros msk'

    sets :: UA.Array Int (Set a)
    sets = UA.listArray (0, bit n - 1) (L.map f [0 .. bit n - 1])
      where
        f 0 = Tip
        f msk = bin x (sets UA.! mskl) (sets UA.! mskr)
          where
            m = middleBit UA.! msk
            x = xs UA.! m
            mskl = msk .&. (bit m - 1)
            mskr = msk `xor` bit m `xor` mskl

    -- TODO: Use an unboxed array for better constant factor
    sortedMasks :: [Int]
    sortedMasks = 0 : L.foldl' step [] [n-1, n-2 .. 0]
      where
        step msks i = bit i : L.map (bit i +) msks ++ msks

It's fascinating to follow your discussion, but, forgive me, where's the use case?

Quite unscientifically, I checked the (approx. 36.000) *.hs that I got from unpacking $HOME/.cabal/packages/**/*.tar.gz (just what I happen to have lying around) - the only mention of powerSet is its implementation (and tests) in containers.

Another quick experiment shows

ghci> foldMap (Sum . length) $ S.powerSet $ S.fromList [1..20::Int]
Sum {getSum = 10485760}
(0.99 secs, 483,502,672 bytes)

ghci> foldMap (Sum . length) $ L.subsequences [1..20::Int]
Sum {getSum = 10485760}
(0.29 secs, 278,620,464 bytes)

I am reading this as: with current implementation, tree balancing doubles space, and triples time, w.r.t. no balancing (and no trees at all). And that's fine?

We really can't apply powerSet to anything substantially larger, because then it wouldn't fit in memory? (But it has to, since the outer tree, and all inner trees, are strict?)

Perhaps something Set a -> [Set a] (producing a lazy list) is more useful? The first candidate is map S.fromList . L.subsequences. (And their "maximally lazy" order is nice.)

Well, there's always the application as an exercise in a data structures course, so by all means, go ahead...

I agree that there's not much utility to this, personally I've just been treating this as a puzzle 😁

We really can't apply powerSet to anything substantially larger, because then it wouldn't fit in memory?

Probably so in practice. The formal limit for the input set size can be considered to be word_size - 1 since the output size must not exceed maxBound :: Int.

@jwaldmann This is absolutely just a puzzle. IIRC, I added powerSet to Data.Set because @ekmett thought it should be there, and offered the simple implementation we currently use. Something Set -> [Set] would certainly be accepted if there's a use for it and a good implementation. Since it's not a "fundamental operation on mathematical sets", I don't think it can slip in under the door like powerSet.

Going back to powerSet: Dropping from $O(2^n \log n)$ space to $O(2^n)$ space (with optimal sharing) should buy usability for a few extra input elements. On the time side, the logarithmic factor probably doesn't matter, so if someone has an $O(2^n)$ space/ $O(2^n \log n)$ time algorithm that actually runs faster than whatever $O(2^n)$ time algorithm we find, we can use the "theoretically slower" one instead.

@meooow25 Your implementation looks interesting, but I can't make head or tail of how it works. Could you explain? How does it achieve lexicographical order? I also noticed a type error: unboxed arrays can't hold Sets.

Wait, now I'm confused... How does that typecheck?

@meooow25 Your implementation doesn't pass the test suite, unfortunately. I'm still very confused about how it passes the type checker.

To clarify, it doesn't pass the test suite once the test has been changed as in #892. That's (at least partly) because the subsets you produce are totally unbalanced—just lists.

How does that typecheck?

I didn't actually use unboxed arrays here (Array vs UArray). Perhaps the comment was confusing, I mean that as TODO: Use unboxed array...

the subsets you produce are totally unbalanced—just lists

Oops, the middle bit logic was at fault. Updated, but it's a little more complex now.

How does it achieve lexicographical order?

That's the sortedMasks part, which is the same as the current algorithm, only it constructs a list of bitmasks instead of a set of sets. Is there any other part you'd like me to explain?

Yes, the whole thing? I don't understand the shape of it. Could you maybe open a PR with lots of comments? I'm a comment maximalist.

I don't want to open a PR without actually testing that it is an improvement.

I'll try to explain with an example:
Consider that the input set is {0,1,2}.
Then we make an array sets to hold the power set, indexed by bitmasks 0b000 to 0b111. The bits in a mask indicate what elements are in the set.
sets = array (0b000,0b111) [(0b000,{}), (0b001,{0}), (0b010,{1}), (0b011,{0,1}), ... (0b111, {0,1,2})]
To build these sets in constant time for any mask, we find the middle set bit, take the corresponding element as root, and pick the left and right subtrees out of sets by indexing with the masks we want (as I described in the comment before).
For example,
sets ! 0b111 = bin 1 (sets ! 0b001) (sets ! 0b100) = bin 1 {0} {2} = {0,1,2}
sets ! 0b101 = bin 2 (sets ! 0b001) (sets ! 0b000) = bin 2 {0} {} = {0,2}
Now that we have all the sets, we need to convert them to lexicographical order. We use the current powerSet algorithm, which does build sets in the right order, to get the right order of bitmasks [0b000, 0b001, 0b011, 0b111, 0b101 ... 0b100]. Then we can pick the corresponding sets out of sets to get [{}, {0}, {0,1}, {0,1,2}, {0,2}, ... {2}], and build them into the output set with fromDistinctAscList.

@meooow25 FWIW, I would prefer to get a PR, even before perf testing. That'll sic CI on it, and make it easier for me (or others) to play with the implementation. I'm going to try to write an implementation based on my own idea to see how that compares, but I won't delay merging yours to do so. I suspect a lot of performance issues will come down to how badly we trash the generational hypothesis, which is mostly about how much long-lived intermediate structure we build. As I said, I don't understand your way yet. Mine (to the extend I've elaborated it so far) does not seem to have a good answer for that.

OK, puzzle time.

how much long-lived intermediate structure we build

As far as I understand, the only intermediate thing should be this "array of sets, indexed by bitmasks"? Just the array itself - its contents is permanent (trees that are reachable from the resulting `Set (Set a)`` ).

convert them to lexicographical order. ...
We use the current powerSet algorithm,

You mean, an equivalent function that directly works on bitmasks? There must be something in Hacker's Delight, HAKMEM, or https://graphics.stanford.edu/~seander/bithacks.html Then use that for backpermute or similar. But that's two extra arrays. But that does not matter much (w.r.t. total amount of output data)?

It's fascinating to follow your discussion, but, forgive me, where's the use case?

The primary usecase I have (and what was the original motivation behind the instance) is that this algorithm is useful when constructing NFAs from DFAs. I have code for that that exists, even if its somewhat slow and off hackage buried in the ekmett/coda repo.

In my particular case I wound up using a Lazy version of the Data.Set library to achieve that, but it also led to the original code for computing sums and products of sets.

Hi,

constructing NFAs from DFAs

Sure - but one would want to build only the accessible part of the automaton? Something like https://gitlab.imn.htwk-leipzig.de/autotool/all0/-/blob/master/fa/src/Autolib/NFA/Det.hs#L30 (note: hull) Since you mention "lazy", perhaps that has the same effect.

Is there a (Haskell) high-performance automata library somewhere? Mine certainly is not. I think that LTL model checkers have such code (for omega automata), e.g., https://spot.lre.epita.fr/

[EDIT] this looks nice https://hackage.haskell.org/package/automata-0.1.0.0/docs/src/Automata.Internal.html#toDfsa (and also travels reachable states only, by my reading)

But anyway - building the powerset is a nice puzzle problem.

So I benchmarked my implementation, not very thoroughly, and it does worse than the current algorithm. Which is not too surprising.

As far as I understand, the only intermediate thing should be this "array of sets, indexed by bitmasks"?

I believe the middleBits array would also stick around as long as sets, since the sets in sets depend on it.

There must be something in Hacker's Delight, HAKMEM, or https://graphics.stanford.edu/~seander/bithacks.html

Thanks for the pointers, I just looked them up but they seem to have nothing about this. I did find a relevant post by searching: https://www.keithschwarz.com/binary-subsets/, but their algorithm is no better than ours.

I will hack on my algorithm's constant factor later and make a PR if it's promising.
I'm also interested to see @treeowl's approach realized.

@meooow25 That's disappointing; I'd have expected an automatic performance boost from just reducing the result size. Two things to consider:

  1. Make sortedMasks a right fold instead of a left fold. I'm not sure that'll help, but it probably won't hurt.
  2. More significantly, don't use fromDistinctAscList. You're probably better off leaving out the empty set (to add later) and writing a custom function to build a full tree of a known size.

One more question: aside from gauge benchmarks, did you try informal ones that calculate just one power set? That should let you use bigger inputs.

a custom function to build a full tree of a known size.

this sounds useful in general? (fromDistinctAscListN, like https://hackage.haskell.org/package/vector-0.13.0.0/docs/Data-Vector.html#v:fromListN)

a custom function to build a full tree of a known size.

this sounds useful in general? (fromDistinctAscListN, like https://hackage.haskell.org/package/vector-0.13.0.0/docs/Data-Vector.html#v:fromListN)

I agree. How should that work? Make a canonical skew binary representation and then merge the digits?

well you said "writing a custom function" so you must have thought of something :-) but that might be easier for N = power of two (what we need here) than for the general case. But then - we could just build a Braun tree? (Okasaki JFP 1997 https://doi.org/10.1017/S0956796897002876 ). We couldn't be any more balanced than that?

@jwaldmann Good point. While the algorithm Okasaki gives is for the wrong order, the one we need is much simpler.

@meooow25 I've been working on this yesterday and today. Aside from the parts that are tricky, there are also some subtleties I'm trying to work through regarding how to arrange things for best performance. One of the subtler ones is the order in which the final set should be constructed, which affects what arrays get constructed and freed when. Another tricky bit that may factor in: since the power set is (presumably) to be used as a Set, it's desirable for the subsets to get smaller the closer they are to the root of the power set. That way, doing something like deleting an element from the power set is as cheap as possible. Can that desire be reconciled with whatever is the most efficient construction order? Etc. I'm going to attempt to do whatever seems simplest for the moment and then revisit it. There are also a lot of fiddly bits for my limited brain to keep track of. If I don't get it all figured out tomorrow I'll put up the code I have so far with explanations and see if maybe you can take a crack at finishing it.


Regarding your version, I suspect the following issues are behind (at least some of) your performance problems:

  1. The msks you use in sortedMasks is going to end up being shared by its two use sites, which might sound good but is actually very bad—something like half this list will end up actually realized in memory. It would be much better to find a different way to calculate sortedMasks; is there some pattern you can exploit from a different direction? I suspect that fixing this could make your solution competitive with (and quite likely better than) the current version.
  2. sets is an utterly gargantuan array, which means that simply constructing it is quite expensive. The RTS will first allocate the array and fill it with undefineds; slow as that is, I think it's fine, and there's unlikely to be a good solution that doesn't pay that cost. The next step is another story: your program will allocate a thunk representing each subset and store it in the array. As usual, allocating an enormous number of long-lived objects (even rather small ones) is hard on the GC, so even this rather boring listArray will likely take quite a while for a "large" set. Once it's allocated, you'll walk the sortedMasks and thunks will get forced and so on. A (smaller?) problem happens on the other end; none of the array structure itself can be freed until you're all done.

OK. So ... here's the current general status of my thingy. I'll go roughly from the top down.

The final merge

Suppose we have a list of length $n+1$ of lists, where the $k$​th list (starting at 0) contains the subsets of size $k$, in order. So for the set ${1,2,3}$, this list would be

biglists123 =
[ [[]]
, [[1], [2], [3]]
, [[1,2], [1,3], [2,3]]
, [[1,2,3]] ]

How do we merge them in order? Define the following:

data SmallerBigger
  = Smaller
  | Bigger
  deriving Show

enddir :: Int -> [SmallerBigger]
enddir n0 = Bigger : ends' n0 []
  where
    ends' 1 = (Smaller :)
    ends' n = (Bigger :) . ends' (n - 1) . ends' (n - 1)

Open up a zipper into the lists above (e.g., biglists123). enddir n will give us a list showing which direction to take to get from each subset to the next one (it will have an extra Smaller at the end that we don't need).

enddir 3 = [Bigger,Bigger,Bigger,Smaller,Smaller,Bigger,Smaller,Smaller]

So we can just move the zipper back and forth to grab the elements in order. Note that if it proves advantageous, we can instead work backwards, using

enddirRev :: Int -> [SmallerBigger]
enddirRev n0 = List.drop 1 $ ends' n0 [Smaller]
  where
    ends' 1 = (Bigger :)
    ends' n = ends' (n - 1) . ends' (n - 1) . (Smaller :)

That will tell us which way to move as we work through decreasing lists in reverse lexicographical order.

Making the lists

I'll start with an approximation and refine it some. Represent each list as a sequence of arrays of subsets (that sequence type could be [], or Array Int, or Data.Sequence.Seq; I suspect Array Int will be best, but I'm not sure). Each inner array represents a subset of the appropriate size and a particular starting element.

Suppose we have the sequence of arrays of the subsets of size $\lfloor (k - 1)/2 \rfloor$ and a similar one of the ones of size $\lceil (k-1)/2 \rceil$. We need to get left subtrees from one, and right subtrees from the other; I'm not sure which should be which; for big- $O$, either will work. We'll use these to construct the subsets of size $k$. For each starting value (i.e., each element of the sequence of arrays of left subtrees), we produce a list of subsets by choosing each of the possible roots, and for each root choosing each of the possible starting values of right subtrees. This need to limit by starting values is the main reason why we have multiple arrays for each subset size. OK! So ... how do we know which roots are acceptable for a given left subset? We need to know which element (well, position in the underlying set) is the last one in that subset. We could store those separately, but that takes a fair bit of space (Between $2^{n-2}$ and $2^{n-1}$ bytes using a UArray Int Word8 if we're very careful, I think). An alternative solution looks very much like enddir above:

zoop :: Int -> [[[Int]]]
zoop n = List.map (List.map ($ [])) . zap . List.map (:) $ [1..n]
  where
    zap :: [[a] -> [a]] -> [[[a] -> [a]]]
    zap [] = []
    zap s@(_ : xs) = s : zap (List.map (List.foldr (.) id) (List.init $ List.tails xs))

For each subset size and each starting position, this produces a list of the first permissible root positions. That is, the position after the last position in each left subset of that size and starting position. We convert that list to an array for compact storage.

Once we have all the arrays, we can convert them to lists, concatenate them, and merge as shown above. As I said, however, this is an approximation. We only actually need to store the subsets of sizes up to $\lfloor n/2\rfloor$. So there's no benefit in actually building arrays for subsets above that size, and doing so is not cheap. So ... let's not. Once we've built all the arrays we need, we can leave the rest as lists, which will be consumed as they are produced.

@meooow25 and @jwaldmann I'm feeling a bit stuck finishing my implementation and probably won't even be able to attempt it this week. Does either of you have the time/energy/brain space to take a shot at it? I'm happy to try to clear up any confusion in my description.

equivalent function that directly works on bitmasks? There must be something ...

[impractical remark of the day]

Knuth: Art of Computer Programming Vol 4A Part 1, Section 7.2.1.3, Exercise 19. We want (the reverse of) the postorder traversal of that tree. But - the given solution (on p. 727) uses a computation that's linear in the bit width, while I was hoping for a constant-time (bit-parallel) solution. Well, if Knuth does not have it - then there's a theorem that this cannot exist at all?

take a shot at it?

I cannot make any promise. I will try to read and understand your code and description, time permitting.

I am more interested in fromDistinctAscListN (#894). Your solution for powerSet would use such a function, in the end?

I've been a bit busy lately. I can get back to this at the end of the week, but unlikely before that.

(to increase the confusion ...) I made a proof-of-concept version of powerset
https://gitlab.imn.htwk-leipzig.de/waldmann/fdaln/-/blob/master/Pow.hs#L46

where

  • all auxiliary computations (middle bit, mask <-> lex order) are on Ints (not Lists),
  • and are cached (but cache is not constructed left-to-right, so laziness is important)
  • the output tree is made from vector slices

not using actual data type from Data.Set but a copy.

These measurements https://gitlab.imn.htwk-leipzig.de/waldmann/fdaln/-/blob/master/3.LOG seem to indicate that computation time for whnf powerset [1..20] goes down by half.

 whnf
    ghc-810.7 (input size 20) Data.Set.powerSet:      OK (5.24s)
      706  ms ±  21 ms
    ghc-810.7 (input size 20) powerSet_set_order:     OK (2.27s)
      289  ms ± 3.0 ms

  whnf
    ghc-906.0 (input size 20) Data.Set.powerSet:      OK (11.68s)
      756  ms ±  26 ms
    ghc-906.0 (input size 20) powerSet_set_order:     OK (3.32s)
      426  ms ±  40 ms

I still am quite skeptical of my code, and my measurements.

I have absolutely no clue what your code is doing. Why calculate sums over the results? Could you test just calculating the results? And do it for inputs of size 20, 21, 22? The other test that would be interesting (I think) is to see how long it takes to delete each element from the power set; that tells us whether the set is arranged nicely.

what your code is doing

it puts the elements of the power set in a vector - in exactly the order that is needed for building the result tree, so we can use vector slices there.

the (hah, one) problem with the code is the global cache (forw_table, backw_table) that should probably go inside powerSet.

Why calculate sums over the results?

(it shows me that foldMap is very much inconsistent over GHC versions ...)

the wnf output I cited does not sum - it's just

bench (f "Data.Set.powerSet") $ whnf S.powerSet (S.fromList[1..w])

as you suggested.

And do it for inputs of size 20, 21, 22?

will do.

to see how long it takes to delete each element from the power set; that tells us whether the set is arranged nicely.

The set is arranged as a Braun tree. Deleting (or even just membership query) will invoke possibly expensive compare on the elements (that is, sets). (And re-balancing. Cost of re-balance will depend on order of deletion? Well perhaps it's amortized.)

It seems likely to me that we want to put the shortest subsets as close as possible to the root. I believe we can do that while maintaining strict (or at least nearly strict) balance. For example, half the subsets start with the first element, so I think the singleton of the second element should probably go at the root—it's $O(\log m)$ to compare that to a set of size $m$. It looks like that sort of trick should work recursively, getting progressively less effective going down and left.

However, I'm more focused on construction time right now.

put the shortest subsets as close as possible to the root.

Ah now I see what you're getting at. At the moment, my tree looks like this

Just [1,4,5]
+- Just [1,2,4]
|  +- Just [1,2,3]
|  |  +- Just [1]
|  |  |  +- Just []
|  |  |  `- Just [1,2]
|  |  `- Just [1,2,3,4,5]
|  |     +- Just [1,2,3,4]
|  |     `- Just [1,2,3,5]
|  `- Just [1,3,4]
|     +- Just [1,2,5]
|     |  +- Just [1,2,4,5]
|     |  `- Just [1,3]
|     `- Just [1,3,5]
|        +- Just [1,3,4,5]
|        `- Just [1,4]
`- Just [2,4,5]
   +- Just [2,3,4]
   |  +- Just [2]
   |  |  +- Just [1,5]
   |  |  `- Just [2,3]
   |  `- Just [2,3,5]
   |     +- Just [2,3,4,5]
   |     `- Just [2,4]
   `- Just [3,4,5]
      +- Just [3]
      |  +- Just [2,5]
      |  `- Just [3,4]
      `- Just [4]
         +- Just [3,5]
         `- Just [4,5]
            `- Just [5]

while you want [2] at the top. OK, the set [2] is represented by the mask 01000 which has index backw 5 (2^3) = 17 which is not too far off from the middle (15).

[EDIT] But I don't see how to make this work in general. I do note the optimization problem here:

given an (ascending, distinct) sequence of elements with weights,
put these (in order) in a size-balanced tree that minimizes total weight, which is the sum, over all sub-trees s, of weight(root(s))* size(s). In this application, weight is length (because it estimates cost of comparisons). Yeah, well, that looks pretty hopeless...

optimisation problem .. hopeless

maybe not - for this special case (powerset). We are talking about approx. 30 trees here (powerset [] .. powerset [1..30] - that last one has 10^9 elements, more won't fit in memory .. that I can buy ...) that can all be pre-computed.

[EDIT] the proper weight function should be succ . length because we always (could) need one extra comparison for end-of-list.

With that weight function, the (Braun) tree for powerset [1..5] shown above has total weight 509. The following tree has cost 409 and that seems to be optimal

(The first number in each node is the size of the subtree)

(32,[2])
+- (17,[1,3])
|  +- (10,[1,2,3])
|  |  +- (3,[1])
|  |  |  +- (1,[])
|  |  |  `- (1,[1,2])
|  |  `- (6,[1,2,4])
|  |     +- (3,[1,2,3,4,5])
|  |     |  +- (1,[1,2,3,4])
|  |     |  `- (1,[1,2,3,5])
|  |     `- (2,[1,2,5])
|  |        `- (1,[1,2,4,5])
|  `- (6,[1,4])
|     +- (3,[1,3,4,5])
|     |  +- (1,[1,3,4])
|     |  `- (1,[1,3,5])
|     `- (2,[1,5])
|        `- (1,[1,4,5])
`- (14,[3])
   +- (7,[2,4])
   |  +- (4,[2,3,4])
   |  |  +- (1,[2,3])
   |  |  `- (2,[2,3,5])
   |  |     `- (1,[2,3,4,5])
   |  `- (2,[2,5])
   |     `- (1,[2,4,5])
   `- (6,[4])
      +- (3,[3,4,5])
      |  +- (1,[3,4])
      |  `- (1,[3,5])
      `- (2,[5])

(more trees to look at: https://gitlab.imn.htwk-leipzig.de/waldmann/fdaln/-/blob/master/opt)

well this is an academic exercise ... because: for the full tree, each membership query will return True, and inserts are useless, and as the tree becomes less full (by deletions), the nice structure will probably be destroyed by rotations.

If we absolutely want, we could store the weight (expected cost of comparison) in each node, and use ratios of total weight (instead of total size) to guide balancing. But - if we have a search tree where comparisons have noticeable cost, then we're doing something quite questionable already.

possibly related: S.V.Nagaraj: Optimal binary search trees TCS 188 (1) 1997
https://doi.org/10.1016/S0304-3975(96)00320-9

@treeowl:
I understand your idea at a high-level, but I'm not able to follow the code you posted above.

Is enddir supposed to give the comparison of adjacent sets in the flattened list of sets? This doesn't seem to be so because it says [1,3] is smaller than [1,2]. And even if we had that, I don't understand how it helps in merging.

I also don't understand the output of zoop. zoop 5 = [[[1],[2],[3],[4],[5]],[[2,3,4,5],[3,4,5],[4,5],[5]],[[3,4,5,4,5,5],[4,5,5],[5]],[[4,5,5,5],[5]],[[5]]]. What does this mean?


I played a bit with my solution and tried to optimize it. Your suggestions were helpful, especially about avoiding thunks in sets.
Here's what I got it down to: link, and I'm curious if it can be optimized further.

And although the decrease is nice and it has been fun to mess with, the code is waaay too complicated for little benefit, and I don't think it would be a good idea to replace the current code with this.

@meooow25 The poorly named enddir indicates whether the size of each subset is one greater or one less than the size of the last one. So we'd see Bigger going from [1] to [1,2] and Smaller going from [1,2,3] to [1,3].

zoop 7 !! 3 !! 2 relates to subsets of size 3 that begin with element number 2 (counting from 0) of an underlying set of size 7. Each element is the number of the first element that could be used as the root given that left subtree. This zoop list is meant to be zipped with the corresponding array of subsets. I don't want to calculate a real example right now, but if you had subsets of [0..20] of size 3 that (counterfactually) looked like

[0,2,9]
[2,5,7]
[3,4,5]

then the corresponding zoop values would be 10, 8, and 6. So for the first subset shown, the possible root positions would be [10] (anything bigger will run out of values for the right subtree). For the second, possible root values would be [8..10], and for the third, possible root values would be [6..10].

Question: is there an easy way to get each zoop value to be something more useful/meaningful than a number? Conceptually, it should be a list of (root, [Set a]) pairs, where each root is paired with all the subsets of the appropriate size that can be its right children.


Note: my intuition keeps suggesting that we should build the power set in reverse, but I don't have proper evidence for that intuition.