StreamMap slow?
lmcd opened this issue · comments
This is a WIP issue as I plan to investigate further.
I'm working on a swift-nio
project that has an emphasis on large file transfers. Getting a stream of bytes out of the server is very fast, but uploading a stream is around 3x slower - even when the incoming buffers aren't handled/stored at all.
I've profiled in instruments and see a lot of traffic (around 15%) in StreamMap
that's just trying to pick a struct out of a collection based on a stream ID. On first glance, it seems to me that we can probably do away with CircularBuffer
, in favour of something more simple. I think the fact that stream IDs start at zero, and increment from there (but may arrive out of order), potentially opens up some optimisation opportunities.
Note that this performance bottleneck was noticed in debug builds. I'll report back with release build findings.
Thanks for filing this: I think the debug builds are vastly overestimating the cost of the StreamMap
. CircularBuffer
is an abstraction over Array
, and so it provides extra layers of performance cost that are not optimised out in debug builds. In release builds, however, it vanishes.
I think the fact that stream IDs start at zero, and increment from there
Incidentally, this is why CircularBuffer
is used: it allows the storage to be totally ordered, without needing to be frequently "compacted".
One note: we could replace CircularBuffer
with Deque
from Swift collections.
Just reading the header comment in StreamMap
(which I should have done before posting this issue, sorry) - it notes that a tradeoff was made resulting in slower lookups, but it assumes number of streams is low. I think if you're totally blasting a server with many gigs/s in uploads over a fast connection, which is our use case, stream IDs per connection could actually wind up being quite substantial. - I think I may have misunderstood how HTTP/2 streams work here.
It also dismisses dictionaries, due to the computational load of computing hashes. But couldn't we just make HTTP2StreamID
Hashable
- and simply provide the underlying Int32
as the hashValue
Thanks for filing this: I think the debug builds are vastly overestimating the cost of the
StreamMap
.CircularBuffer
is an abstraction overArray
, and so it provides extra layers of performance cost that are not optimised out in debug builds. In release builds, however, it vanishes.
I've noticed release builds being way faster too, so maybe all of this is a non-issue. I'll check when I have time 👍
It also dismisses dictionaries, due to the computational load of computing hashes. But couldn't we just make HTTP2StreamID Hashable - and simply provide the underlying Int32 as the hashValue
This totally works, but the tradeoff isn't straightforward. For small dictionaries (in this case, fewer than 100 elements), it's cheaper to do a linear array scan to find the element than it is to hash the key and do a jump. This is fairly non-intuitive but comes down to the way branch predictors and cache preloaders work: the memory subsystem can observe and optimise this access pattern, meaning that we're never waiting for memory, in a way that it cannot do with hashing. The data structure is therefore optimised for small size. For larger sizes, we do binary search instead, which is still extremely cheap.
Thanks for the technical explanation! I'm glad a lot of thought went into this.
Diving a bit deeper, I'm seeing a lot of the work is actually coming from debugOnly
calls - so yeah seems like a false alarm. Notably, setting _backingCheck
. Now I'm wondering why _UInt24
is used, but that's a rabbit hole I should probably avoid for today!
Thanks for the technical explanation! I'm glad a lot of thought went into this.
Diving a bit deeper, I'm seeing a lot of the work is actually coming from
debugOnly
calls - so yeah seems like a false alarm. Notably, setting_backingCheck
. Now I'm wondering why_UInt24
is used, but that's a rabbit hole I should probably avoid for today!
The _backingCheck
is just doing some (as you noticed) debug-build-only checking that may help to catch bugs early. It uses _UInt24
because we happened to have 24 spare bits in the data structure. So I thought why not (in debug builds) make use of those bits and store some extra information that we can then verify all the time. If I had used a normal UInt32
then the data structure would've grown (in debug & release builds) and if I had used UInt16
then we'd still have wasted 8 bits.
Concretely, those 24 bits store the length of the buffer such that we can detect if some code illegally stores and reuses buffer offsets across resizes. That's illegal because (like other many other data structures) the actual indexes get invalidates if you add/remove elements.