pykeio / ort

Hi!

I've been testing recently added feature in #155

Overall, the performance improvement from that feature is great. However, it seems there's an issue with scaling. We're using the default CPU EP and we have more than 20 models (sessions) that are shared (Arc'ed) between all threads and on which we're calling run concurrently from all worker threads (note: each thread does not run an inference request on every model, but chooses a specific one depending on certain conditions).
As the number of threads increases, I see an increase in system (kernel) CPU load. At 88 threads, our system CPU load increased from <5% to 12-15%. strace showed that ~90% of kernel time is spent in futex syscalls. Take a look at what perf shows:

I'm assuming that if we had a single shared model, then the contention would be even higher.
There are essentially no other futex syscalls in the whole flamegraph (unfortunately, I cannot share raw .svg, sorry about that)

Then, I've stumbled upon the following documentation (the Share allocator(s) between sessions part)
https://onnxruntime.ai/docs/get-started/with-c.html#features
I've hypothesized that if there's a global session object, and many threads are calling run on it, then run could be getting stuck on some kind of arena mutex. I then tried changing the application to have a session(s) per worker thread, instead of shared ones. If sessions have their own local arena, I expected to see an increased memory usage, but reduced contention.
Unfortunately, pretty much nothing changed, and the before/after flamegraphs look more or less identical.

So, I'm not familiar with ONNX's internals, but could it be that the arena allocator is shared between all sessions by default? Do you think it makes sense to make that configurable? Is it an arena mutex at all, or is my assumption simply wrong? I'm assuming it is an arena mutex because these syscalls show up in Value::from_array, drop calls, etc.

Also, somewhat related, take a look at zoomed-in Session::run:

There's two Drop::drop calls, zooming-in on them:

Again, I'm not familiar with ONNX's internals, but arenas have to reset their chunk pointer at some point, and when new values are written, the old memory simply gets overwritten. As such, it makes sense (at least, in other cases I've used arenas), to avoid calling Drop at all. With that in mind, does it make sense to avoid calling ReleaseMemoryInfo/ReleaseValue at all, if the allocator is an arena? That could be a nice optimization

ort/src/memory.rs

Line 146 in d1ae982

ortsys![unsafe ReleaseMemoryInfo(self.ptr)];

ort/src/value.rs

Line 703 in d1ae982

ortsys![unsafe ReleaseValue(ptr)];

Though, it would be easy to create UB if MemoryInfo/Value isn't tied to arena's lifetime, and the memory in the arena gets overwritten

Those futex calls are probably from ort as each call to an ONNX Runtime API would (needlessly) lock a Mutex. I removed the mutex in #160, does ort @ 04df44d help with the contention at all?

You're right. I'm not seeing any futex calls in ort stacks anymore. Kernel load is now at ~1-1.5%

Thank you!

High contention when using arena allocators