StreamMap slow?

Question

StreamMap slow?

lmcd opened this issue 2 years ago · comments

This is a WIP issue as I plan to investigate further.

I'm working on a swift-nio project that has an emphasis on large file transfers. Getting a stream of bytes out of the server is very fast, but uploading a stream is around 3x slower - even when the incoming buffers aren't handled/stored at all.

I've profiled in instruments and see a lot of traffic (around 15%) in StreamMap that's just trying to pick a struct out of a collection based on a stream ID. On first glance, it seems to me that we can probably do away with CircularBuffer, in favour of something more simple. I think the fact that stream IDs start at zero, and increment from there (but may arrive out of order), potentially opens up some optimisation opportunities.

Note that this performance bottleneck was noticed in debug builds. I'll report back with release build findings.

Cory Benfield · Answer 1 · Sat Aug 20 2022 00:05:23 GMT+0800 (China Standard Time)

Thanks for filing this: I think the debug builds are vastly overestimating the cost of the StreamMap. CircularBuffer is an abstraction over Array, and so it provides extra layers of performance cost that are not optimised out in debug builds. In release builds, however, it vanishes.

I think the fact that stream IDs start at zero, and increment from there

Incidentally, this is why CircularBuffer is used: it allows the storage to be totally ordered, without needing to be frequently "compacted".

Cory Benfield · Answer 2 · Sat Aug 20 2022 00:05:43 GMT+0800 (China Standard Time)

One note: we could replace CircularBuffer with Deque from Swift collections.

Lee M · Answer 3 · Sat Aug 20 2022 00:08:11 GMT+0800 (China Standard Time)

Just reading the header comment in StreamMap (which I should have done before posting this issue, sorry) - it notes that a tradeoff was made resulting in slower lookups, but it assumes number of streams is low. ~~I think if you're totally blasting a server with many gigs/s in uploads over a fast connection, which is our use case, stream IDs per connection could actually wind up being quite substantial.~~ - I think I may have misunderstood how HTTP/2 streams work here.

It also dismisses dictionaries, due to the computational load of computing hashes. But couldn't we just make HTTP2StreamID Hashable - and simply provide the underlying Int32 as the hashValue

Lee M · Answer 4 · Sat Aug 20 2022 00:10:56 GMT+0800 (China Standard Time)

Thanks for filing this: I think the debug builds are vastly overestimating the cost of the StreamMap. CircularBuffer is an abstraction over Array, and so it provides extra layers of performance cost that are not optimised out in debug builds. In release builds, however, it vanishes.

I've noticed release builds being way faster too, so maybe all of this is a non-issue. I'll check when I have time 👍

Cory Benfield · Answer 5 · Sat Aug 20 2022 01:10:54 GMT+0800 (China Standard Time)

It also dismisses dictionaries, due to the computational load of computing hashes. But couldn't we just make HTTP2StreamID Hashable - and simply provide the underlying Int32 as the hashValue

This totally works, but the tradeoff isn't straightforward. For small dictionaries (in this case, fewer than 100 elements), it's cheaper to do a linear array scan to find the element than it is to hash the key and do a jump. This is fairly non-intuitive but comes down to the way branch predictors and cache preloaders work: the memory subsystem can observe and optimise this access pattern, meaning that we're never waiting for memory, in a way that it cannot do with hashing. The data structure is therefore optimised for small size. For larger sizes, we do binary search instead, which is still extremely cheap.

Lee M · Answer 6 · Sat Aug 20 2022 01:21:09 GMT+0800 (China Standard Time)

Thanks for the technical explanation! I'm glad a lot of thought went into this.

Diving a bit deeper, I'm seeing a lot of the work is actually coming from debugOnly calls - so yeah seems like a false alarm. Notably, setting _backingCheck. Now I'm wondering why _UInt24 is used, but that's a rabbit hole I should probably avoid for today!

Johannes Weiss · Answer 7 · Sat Aug 20 2022 15:15:16 GMT+0800 (China Standard Time)

Thanks for the technical explanation! I'm glad a lot of thought went into this.

Diving a bit deeper, I'm seeing a lot of the work is actually coming from debugOnly calls - so yeah seems like a false alarm. Notably, setting _backingCheck. Now I'm wondering why _UInt24 is used, but that's a rabbit hole I should probably avoid for today!

The _backingCheck is just doing some (as you noticed) debug-build-only checking that may help to catch bugs early. It uses _UInt24 because we happened to have 24 spare bits in the data structure. So I thought why not (in debug builds) make use of those bits and store some extra information that we can then verify all the time. If I had used a normal UInt32 then the data structure would've grown (in debug & release builds) and if I had used UInt16 then we'd still have wasted 8 bits.

Concretely, those 24 bits store the length of the buffer such that we can detect if some code illegally stores and reuses buffer offsets across resizes. That's illegal because (like other many other data structures) the actual indexes get invalidates if you add/remove elements.