Better partitioning in the bulk loading algorithm

Question

Better partitioning in the bulk loading algorithm

mourner opened this issue 7 years ago · comments

Volodymyr Agafonkin commented 7 years ago

Currently, the bulk loading algorithm partitions each node into approximately sqrt(N) x sqrt(N) child nodes. This becomes a problem if a node is not a perfect square — child nodes will get narrower the deeper you go. I noticed this problem when looking at the viz for a rectangular data space:

Notice the very narrow rectangles at the bottom. We could fix this by designing an algorithm that picks a K x M partitioning that takes the aspect ratio of a node into account, to make child nodes approach square shape no matter how narrow they are. This should make query performance on bulk-loaded trees better.

cc @danpat

Volodymyr Agafonkin · Answer 1 · Wed Apr 26 2017 04:22:35 GMT+0800 (China Standard Time)

Making good progress on this. Before and after:

eric-corumdigital · Answer 2 · Wed Nov 27 2019 03:32:48 GMT+0800 (China Standard Time)

Were your improvements here merged into master?

Volodymyr Agafonkin · Answer 3 · Wed Nov 27 2019 17:21:16 GMT+0800 (China Standard Time)

No — the approach from above was flawed (making bulk-load performance worse) and I never figured out how to go around that. Maybe I'll try again some time.

Volodymyr Agafonkin · Answer 4 · Wed Nov 27 2019 17:32:31 GMT+0800 (China Standard Time)

Pushed the work-in-progress code I had to a7047e9 — feel free to poke around this. As far as I recall now, there were two issues:

Despite the tree looking much better visually, I couldn't get a meaningful search query improvement in benchmarks. Maybe I measured wrong though.
I didn't like having to recalculate the bounding box for all items on each iteration, this didn't feel right, although I never found an alternative.