BTreeIterator is very slow

Question

BTreeIterator is very slow

timvermeulen opened this issue 8 years ago · comments

You mention that forEach is more efficient than a regular for loop for BTrees, which isn't really ideal. It's true that it's a lot easier to write an efficient forEach implementation than to write an efficient iterator, but it's possible to write an efficient iterator! As a proof of concept, here's an improved BTreeIterator:

public class BTreeIterator<Key: Comparable, Value>: IteratorProtocol {
    public typealias Element = (Key, Value)
    typealias Node = BTreeNode<Key, Value>

    let node: Node

    var iterator: BTreeIterator?
    var index = -1

    init(root: Node) {
        node = root
        incrementChildIterator()
    }

    private func incrementChildIterator() {
        index += 1
        if !node.isLeaf && index < node.children.count {
            iterator = node.children[index].makeIterator()
        }
    }

    public func next() -> Element? {
        if let childIterator = iterator {
            if let nextElement = childIterator.next() {
                return nextElement
            }
            else {
                iterator = nil
                return next()
            }
        }
        else {
            guard index < node.elements.count else { return nil }

            defer { incrementChildIterator() }
            return node.elements[index]
        }
    }
}

My benchmarks show no significant performance difference between for-in and forEach, now. Let me know what you think!

Karoy Lorentey · Answer 1 · Wed Sep 28 2016 16:24:21 GMT+0800 (China Standard Time)

This is neat! Reminds me a little of the segmented iterators article (http://lafstern.org/matt/segmented.pdf).

The iterator should be a COW value type, so it needs a simple wrapper struct; but other than that, this looks great.

BTreePath shouldn't be too slow, though. Maybe it should be refactored to speed it up as well.

I think it's time for me to set up a benchmarking build target (issue #5) — it's hard to reason about performance without verifiable measurements.

Tim Vermeulen · Answer 2 · Wed Sep 28 2016 19:28:24 GMT+0800 (China Standard Time)

Interesting article, that's indeed very similar.

When I implemented in-order iteration on a binary search tree, I first went with recursive iteration (like the snippet above), but using a stack turned out to be way more efficient, because it wouldn't need an iterator for every single node. This isn't as much of a problem for B-Trees, it seems, because of the relatively low amount of nodes. Still, it might be worth exploring using a stack instead, as it also would allow using a struct rather than a class.

Tim Vermeulen · Answer 3 · Wed Nov 09 2016 04:24:26 GMT+0800 (China Standard Time)

I was working on this some more, and it's looking really good. The only downside is that this approach without BTreeStrongPath doesn't really support makeIterator(from:), makeIterator(fromOffset:) and makeIterator(from:choosing:).

How big of a deal is this? I find it a bit counterintuitive to call makeIterator() manually, so I'm not sure what I would use these functions for. Wouldn't it make more sense for the user to first slice the tree and then iterate over it?

Karoy Lorentey · Answer 4 · Wed Nov 09 2016 04:47:44 GMT+0800 (China Standard Time)

Adding those methods seemed like a good idea at the time, but I haven't actually found them very useful in practice. I don't think it'd be a big deal to remove them at all—the package's API surface is rather too large anyway. (Although if necessary I believe we could also write code to convert to/from your iterators and strong paths.) The same goes for the stoppable forEach variant: if iteration is fast, there is no need to include it.

I'm still surprised BTreeStrongPath turned out to be so slow — it doesn't really do anything that should cost much performance. It does keep track of the current offset, but it has the current node immediately available, which prevents the need for recursion on every iteration — so I'd've expected it to have comparable performance.

I had to work on SortedBag for another project, but now that's done I can hopefully start benchmarking soon — I really want to understand what's going on there.

Tim Vermeulen · Answer 5 · Wed Nov 09 2016 05:55:20 GMT+0800 (China Standard Time)

I'll try to see if I can find out what's so slow about the current indexing as well. Meanwhile, maybe it's a good idea to add a 5.0 branch?

Karoy Lorentey · Answer 6 · Wed Nov 09 2016 06:33:44 GMT+0800 (China Standard Time)

Ah, indeed, a new branch is a good idea.

Karoy Lorentey · Answer 7 · Wed Nov 09 2016 06:59:12 GMT+0800 (China Standard Time)

OK, so I added a really simple iteration benchmark. It reproduces your results; iteration is indeed much slower than I expected:

What's remarkable is that BTreeIterator is even slower than traversing the tree using cursors.

Tim Vermeulen · Answer 8 · Wed Nov 09 2016 07:30:59 GMT+0800 (China Standard Time)

I did a bunch of tests, and found something really interesting. If I copy the entire library to my project (so no import), and I run the tests under release mode from the command line, the 4.0.1 iterator is ridiculously fast. Ten times as fast as forEach.

Strangely enough, running the tests in Xcode itself doesn't show anything of the sort, as 1. forEach beats the iterator there, and 2. both are way slower than in the command line (forEach is about 4 times as slow, the iterator is about 200 (!) times as slow). I'm doing all tests in the standard release configuration, so I don't understand where these differences come from.

Could you try to reproduce any of this?

Karoy Lorentey · Answer 9 · Wed Nov 09 2016 08:16:00 GMT+0800 (China Standard Time)

On 2016-11-09, at 00:30, Tim Vermeulen notifications@github.com wrote:
I did a bunch of tests, and found something really interesting. If I copy the entire library to my project (so no import), and I run the tests under release mode from the command line, the 4.0.1 iterator is ridiculously fast. Ten times as fast as forEach.

In-module performance is expected to be ridiculuously fast (see below), but wow, forEach shouldn’t ever be slower than BTreeIterator -- these surprises make benchmarking fun. :-)
Strangely enough, running the tests in Xcode itself doesn't show anything of the sort, as 1. forEach beats the iterator there, and 2. both are way slower than in the command line (forEach is about 4 times as slow, the iterator is about 200 (!) times as slow). I'm doing all tests in the standard release configuration, so I don't understand where these differences come from.

Could you try to reproduce any of this?

Compiling BTree in the same module as your source code will allow whole module optimization to kick in, which enables generic specialization. 4x-200x speedups are right in the expected range. (See the light green vs. dark green lines in README's benchmarks, for example; also, see the section on imported generics at the end:)

https://github.com/lorentey/BTree#remark-on-performance-of-imported-generics https://github.com/lorentey/BTree#remark-on-performance-of-imported-generics

This is a serious limitation for collection packages like BTree; unfortunately, we don’t have a way around it at this time. I think there is a way to specify a list of type parameters for which specific methods should be pre-specialized during module compilation (with @_specialize attributes), but AFAICT it’s not public API, and it might not work correctly yet. I haven’t tried it yet, but it would be worth a look.

You don’t see the same effect for stdlib types like Array because stdlib is essentially compiled anew with each module. This will not remain like this forever: once we have stable ABI, stdlib might become a normal Swift library, and Swift will have to provide new solutions to keep Array/Dictionary/etc performant. Perhaps third-party libraries will get to use those as well. (And if not, we can just propose to add new collections to stdlib.)

Compiled as an external module, BTree's algorithmic performance improvements can easily run laps over stdlib’s collection types -- but it takes quite a lot of elements for this to be measurable. For operations that don’t have algorithmic improvements (like iteration), a cross-module collection is unlikely to get close to what Array et al. provide, at least not in Swift 3 -- but we can often reduce the 200x slowdown to ~10x with carefully chosen API. (Cursor is a good example of this, although it’s nowhere near 10x.)

None if this is really worth worrying much about -- cross-module Swift collections should still be able to beat (or at least match) the performance of the best Objective-C collections; and we’ve been happily using those for decades.

However, optimizing for cross-module performance is very different to optimization when generic specialization is available. I’m not sure which kind we should optimize for in BTree’s case: I expect the vast majority of people will just use it via CocoaPods/Carthage/SPM, but there will definitely be some who’ll want to “inline” it into one of their modules for a speed boost.

Tim Vermeulen · Answer 10 · Wed Nov 09 2016 09:41:43 GMT+0800 (China Standard Time)

Right, so I was aware about the optimisation limitations of imported generics. I think it's a shame that it probably won't be fixed very soon, but I don't doubt that it will be fixed eventually, so it would probably make the most sense to optimise for that (especially since people can already compile the library in their own module if they want to).

However, the 4x-200x differences were made between the command line and Xcode, not between importing and not importing. But I found out that I forgot to turn on whole-module optimisation, so that explains a lot!

Still, it's a big mystery why the iterator beats forEach (in my tests). Here's my benchmark:

let list = List(0 ..< 100_000_000)
let result = 4999999950000000

let start = Date()

var sum = 0
for element in list {
    sum += element
}
print(sum == result) // true

let end = Date()
print(end.timeIntervalSince(start))

It runs in ~0.7 sec, while using forEach instead takes nearly 7 seconds. Do you see anything wrong with this particular benchmark?

Kachi · Answer 11 · Sun Feb 16 2020 22:02:35 GMT+0800 (China Standard Time)

@lorentey are the recommendations under https://github.com/lorentey/BTree#remark-on-performance-of-imported-generics still valid for Swift 5? It's been 3+ years since the last comment on this thread. Curious if copying the BTree project into a module is still required for a speed boost.

Karoy Lorentey · Answer 12 · Sat Jun 05 2021 02:16:34 GMT+0800 (China Standard Time)

The technical limitation was resolved by the introduction of @inlinable, but this package has not adopted it.

We're working on a new "official" implementation in apple/swift-collections#1 -- I expect that work will replace this package.