Performance benchmarks of tight loops etc.

Question

Performance benchmarks of tight loops etc.

dumblob opened this issue 3 years ago · comments

I wonder how this Nim CPS implementation perform in tight loops and similar "critical" contexts. We can discuss theoretical performance, but I'm really interested in real measurements compared to non-CPS compiled binaries.

Anyone performed such tight loops benmarks already? What were the results?

And if anyone had benchmarks of multicore apps shuffling data back & forth between cores, that'd be even better!

Note, I'm not asking for rigorous benchmarking. I just want to get a glimpse of how well current compilers (LLVM, GCC) can optimize the code generated by this CPS macro(s) and what's the practical impact and whether it "wildly varies" or is rather stable near-constant improvement/worsening independent from scenario.

cabboose · Answer 1 · Fri Sep 24 2021 18:01:19 GMT+0800 (China Standard Time)

I can only speak to the shuffling of data in a multithreaded context and say that we are working on using loony with a memory safety implementation that assures temporal validity of the memory.

However that implementation will need some time to make and then benchmark against alternatives before we incorporate it

Smooth Operator · Answer 2 · Fri Sep 24 2021 21:38:56 GMT+0800 (China Standard Time)

https://github.com/nim-works/cps/blob/master/stash/performance.nim

Newer compilers are starting to demonstrate significant gains over closure iterators; I think you start to see notable numbers at around LLVM/GCC 9 or so and the more recent stuff is faster.

Smooth Operator · Answer 3 · Fri Sep 24 2021 21:45:59 GMT+0800 (China Standard Time)

cps iterator: 7.300357578 s
-3888678242538471851
closure iterator: 5.350151789 s
7679760801092768143

We lost our speed advantage, looks like. 😆

Or maybe I'm running it wrong somehow. But this is probably due to the recent workaround for the compiler fix for type conversions. I guess we really need to start monitoring performance.