Mysticial / Flops

How many FLOPS can you achieve?

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Version 3: Investigate poor throughput on Skylake.

Mysticial opened this issue · comments

The add/sub benchmark fails to achieve max throughput on Skylake when running single-threaded. Figure out why and fix it.

What is the maximum throughput, that you expect for Add/Sub on Skylake?

On Skylake Desktop (not server), the Haswell binary (FMA3) only seems to get about 80 - 90% of the theoretical flops for add/sub when running single-threaded. Multi-threaded is fine since the hyperthread seems to fill up those pipeline bubbles.

Single-Precision - 256-bit AVX - Add/Sub
    GFlops = 41.856
    Result = 5.37046e+06

Double-Precision - 256-bit AVX - Add/Sub
    GFlops = 21.664
    Result = 2.77755e+06

Single-Precision - 256-bit AVX - Multiply
    GFlops = 50.592
    Result = 6.41972e+06

Double-Precision - 256-bit AVX - Multiply
    GFlops = 26.016
    Result = 3.31828e+06

Single-Precision - 256-bit AVX - Multiply + Add
    GFlops = 45.12
    Result = 4.8147e+06

Double-Precision - 256-bit AVX - Multiply + Add
    GFlops = 22.224
    Result = 2.33547e+06

Single-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 107.328
    Result = 6.82334e+06

Double-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 55.392
    Result = 3.54084e+06

Add/Sub, Multiply, and Multiply+Add should all be the same for the same sized datatype, but Add/sub 20% less and Multiply-Add is 15% less.

This affects both Add/sub and Multiply-Add.