Version 3: Investigate poor throughput on Skylake.

Question

Version 3: Investigate poor throughput on Skylake.

Mysticial opened this issue 8 years ago · comments

The add/sub benchmark fails to achieve max throughput on Skylake when running single-threaded. Figure out why and fix it.

Robert Schade · Answer 1 · Wed Jul 26 2017 00:34:28 GMT+0800 (China Standard Time)

What is the maximum throughput, that you expect for Add/Sub on Skylake?

Alexander Yee · Answer 2 · Wed Jul 26 2017 09:03:07 GMT+0800 (China Standard Time)

On Skylake Desktop (not server), the Haswell binary (FMA3) only seems to get about 80 - 90% of the theoretical flops for add/sub when running single-threaded. Multi-threaded is fine since the hyperthread seems to fill up those pipeline bubbles.

Single-Precision - 256-bit AVX - Add/Sub
    GFlops = 41.856
    Result = 5.37046e+06

Double-Precision - 256-bit AVX - Add/Sub
    GFlops = 21.664
    Result = 2.77755e+06

Single-Precision - 256-bit AVX - Multiply
    GFlops = 50.592
    Result = 6.41972e+06

Double-Precision - 256-bit AVX - Multiply
    GFlops = 26.016
    Result = 3.31828e+06

Single-Precision - 256-bit AVX - Multiply + Add
    GFlops = 45.12
    Result = 4.8147e+06

Double-Precision - 256-bit AVX - Multiply + Add
    GFlops = 22.224
    Result = 2.33547e+06

Single-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 107.328
    Result = 6.82334e+06

Double-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 55.392
    Result = 3.54084e+06

Add/Sub, Multiply, and Multiply+Add should all be the same for the same sized datatype, but Add/sub 20% less and Multiply-Add is 15% less.

This affects both Add/sub and Multiply-Add.