High contention when using arena allocators
gootorov opened this issue · comments
Hi!
I've been testing recently added feature in #155
Overall, the performance improvement from that feature is great. However, it seems there's an issue with scaling. We're using the default CPU EP and we have more than 20 models (sessions) that are shared (Arc'ed) between all threads and on which we're calling run
concurrently from all worker threads (note: each thread does not run an inference request on every model, but chooses a specific one depending on certain conditions).
As the number of threads increases, I see an increase in system (kernel) CPU load. At 88 threads, our system CPU load increased from <5% to 12-15%. strace
showed that ~90% of kernel time is spent in futex
syscalls. Take a look at what perf
shows:
I'm assuming that if we had a single shared model, then the contention would be even higher.
There are essentially no other futex
syscalls in the whole flamegraph (unfortunately, I cannot share raw .svg
, sorry about that)
Then, I've stumbled upon the following documentation (the Share allocator(s) between sessions
part)
https://onnxruntime.ai/docs/get-started/with-c.html#features
I've hypothesized that if there's a global session object, and many threads are calling run
on it, then run
could be getting stuck on some kind of arena mutex. I then tried changing the application to have a session(s) per worker thread, instead of shared ones. If sessions have their own local arena, I expected to see an increased memory usage, but reduced contention.
Unfortunately, pretty much nothing changed, and the before/after flamegraphs look more or less identical.
So, I'm not familiar with ONNX's internals, but could it be that the arena allocator is shared between all sessions by default? Do you think it makes sense to make that configurable? Is it an arena mutex at all, or is my assumption simply wrong? I'm assuming it is an arena mutex because these syscalls show up in Value::from_array
, drop
calls, etc.
Also, somewhat related, take a look at zoomed-in Session::run
:
There's two Drop::drop
calls, zooming-in on them:
Again, I'm not familiar with ONNX's internals, but arenas have to reset their chunk pointer at some point, and when new values are written, the old memory simply gets overwritten. As such, it makes sense (at least, in other cases I've used arenas), to avoid calling Drop
at all. With that in mind, does it make sense to avoid calling ReleaseMemoryInfo
/ReleaseValue
at all, if the allocator is an arena? That could be a nice optimization
Line 146 in d1ae982
Line 703 in d1ae982
Though, it would be easy to create UB if MemoryInfo
/Value
isn't tied to arena's lifetime, and the memory in the arena gets overwritten
You're right. I'm not seeing any futex calls in ort
stacks anymore. Kernel load is now at ~1-1.5%
Thank you!