CoreRT slower then regular .NET

Question

CoreRT slower then regular .NET

kant2002 opened this issue 4 years ago · comments

I thinking about checking how CoreRT works for the Wavelets and decide to use https://github.com/codeprof/TurboWavelets.Net as starting point.

I migrate project to new SDK format and add Benchmarks.Net using samples provided.

To my disappointment regular .NET seems to be faster then CoreRT.

// * Summary *

BenchmarkDotNet=v0.12.1.1420-nightly, OS=Windows 10.0.18363.1082 (1909/November2019Update/19H2)
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.100-rc.1.20452.10
  [Host]     : .NET 5.0.0 (5.0.20.45114), X64 RyuJIT
  .NET 5.0   : .NET 5.0.0 (5.0.20.45114), X64 RyuJIT
  CoreRt 5.0 : .NET 5.0.29330.02 @BuiltBy: dlab14-DDVSOWINAGE075 @Branch: master @Commit: 145402e00724acbc9e7636739140fb84f7d64845, X64 AOT


|                Method |        Job |    Runtime |      Mean |    Error |   StdDev | Ratio | RatioSD |
|---------------------- |----------- |----------- |----------:|---------:|---------:|------:|--------:|
| Waveletimageupscaling |   .NET 5.0 |   .NET 5.0 | 155.72 ms | 3.267 ms | 9.426 ms |  1.00 |    0.00 |
| Waveletimageupscaling | CoreRt 5.0 | CoreRt 5.0 | 167.68 ms | 3.303 ms | 9.478 ms |  1.08 |    0.09 |
|                       |            |            |           |          |          |       |         |
|      AdaptiveDeadzone |   .NET 5.0 |   .NET 5.0 |  30.40 ms | 0.588 ms | 0.764 ms |  1.00 |    0.00 |
|      AdaptiveDeadzone | CoreRt 5.0 | CoreRt 5.0 |  33.79 ms | 0.683 ms | 1.763 ms |  1.14 |    0.08 |

So I have generic questions.

Does this results expected with CPU-bound workloads.
What can I do to look more closely on this particular case.

Michal Strehovský · Answer 1 · Mon Oct 05 2020 01:38:20 GMT+0800 (China Standard Time)

For compute heavy workloads that don't use things like HW intrinsics, I would expect both to be pretty much on par, since codegen is the same.

I would run both under PerfView and check:

GC Stats - does the GC do more work in one of them?
Look at CPU samples - are the same methods hot? Is there something that stands out? If so, I would check disassembly on both and compare if we got worse codegen somewhere.

Jan Kotas · Answer 2 · Mon Oct 05 2020 09:34:33 GMT+0800 (China Standard Time)

It is not unusual that performance of CPU-bound microbenchmarks is sensitive to memory alignment, code alignment or other factors that results into trends like this: dotnet/runtime#39031 (comment) . This can be one of these bi-modal cases and you may be just hitting the lucky/unlucky spots on the spectrum.

Another potential source of the difference is that RyuJIT in dotnet/corert is several months old at this point. It is possible that the RyuJIT shipping in .NET 5 has bug fixes that make a difference for this micro-benchmark. This will get fixed once we migrate the project to dotnet/runtimelab and pick up up-to-date RyuJIT.

What can I do to look more closely on this particular case.

Michal's advice in #8354 (comment) is spot on.

Andrii Kurdiumov · Answer 3 · Tue Oct 06 2020 22:14:09 GMT+0800 (China Standard Time)

@jkotas Thanks for explanation about potential root causes. I thought that this maybe related to fact that this is micro-benchmark, but do not though that this maybe due to changes in the runtime.

@MichalStrehovsky I would try to look. Since my priority was to have interesting use-case for CoreRT would be better then regular .NET I have to scratch my head a bit to find it.

@RUSshy you can see my benchmarks here https://github.com/kant2002/TurboWavelets.Net/tree/kant/benchmarks this is pretty trivial microbenchmarks, This is not actual project where maybe I will have some gains.