mratsim / laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parallel strided iteration does not scale linearly

mratsim opened this issue · comments

On the following benchmark on my dual core machine with hyper-threading enabled:

https://github.com/numforge/laser/blob/d725651c0f8ca1da3f761e27e9846bc81b9341f3/benchmarks/loop_iteration/iter_bench_prod.nim (branch foreach-dsl #4)

I get the following score without OpenMP

Warmup: 1.1911 s, result 224 (displayed to avoid compiler optimizing warmup away)

Production implementation for tensor iteration - float64
Collected 1000 samples in 8.931 seconds
Average time: 8.930ms
Stddev  time: 0.345ms
Min     time: 8.723ms
Max     time: 14.785ms

Display output[[0,0]] to make sure it's not optimized away
-0.41973403633413

Production implementation for tensor iteration - float64
Collected 1000 samples in 29.597 seconds
Average time: 29.597ms
Stddev  time: 1.125ms
Min     time: 27.002ms
Max     time: 38.606ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

and with OpenMP

Warmup: 1.1874 s, result 224 (displayed to avoid compiler optimizing warmup away)

Production implementation for tensor iteration - float64
Collected 1000 samples in 4.094 seconds
Average time: 4.092ms
Stddev  time: 0.206ms
Min     time: 3.897ms
Max     time: 5.459ms

Display output[[0,0]] to make sure it's not optimized away
-0.41973403633413

Production implementation for tensor iteration - float64
Collected 1000 samples in 24.025 seconds
Average time: 24.022ms
Stddev  time: 1.127ms
Min     time: 22.379ms
Max     time: 33.763ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

Potential explanations:

  • hitting the maximum memory bandwidth:
    • Parallel strided iteration also means parallel cache misses.
    • One way to alleviate that would be to use GCC/Clang builtin_prefetch
    • Note: this does not reflect experiment in https://github.com/zy97140/omp-benchmark-for-pytorch. Furthermore next element computation shouldn't require any memory access.
      Unfortunately, even if it doesn't require memory access, the CPU cannot leverage instruction level parallelism to execute the main computation at the same time as computing next elements location as there is branching.
  • accesses to shape and strides for strided iteration are slow:
    • This might happen if the compiler doesn't create a local copy or save them in registers if it doesn't detect that no shape/strides access mutate them.
  • strided iteration is running out of registers and/or instruction cache space
    • Unlikely are those are per-CPU and not per socket we should have a linear increase