Version 3: Add support for multiple processor groups in Windows.

Question

Version 3: Add support for multiple processor groups in Windows.

Mysticial opened this issue 8 years ago · comments

Alexander Yee commented 8 years ago

Break the 64 thread limit.

Jawad Al Shaikh · Answer 1 · Sat Jun 17 2017 13:14:17 GMT+0800 (China Standard Time)

Any Estimate when this feature will be ready?

I Have: 4 Xeon Processors (Sockets) E7-8890 V3 2.5 GHZ (total cores: 72, total logical processors: 144)

I ran ./master/version3/binaries-windows/2013-Haswell.exe

Only one socket (36 logical cores) fully utilized in the test as shown below:

Running Haswell tuned binary with 1 thread...

Single-Precision - 128-bit AVX - Add/Sub
    GFlops = 10.24
    Result = 1.31089e+06

Double-Precision - 128-bit AVX - Add/Sub
    GFlops = 5.104
    Result = 653652

Single-Precision - 128-bit AVX - Multiply
    GFlops = 18.096
    Result = 2.31105e+06

Double-Precision - 128-bit AVX - Multiply
    GFlops = 8.904
    Result = 1.13709e+06

Single-Precision - 128-bit AVX - Multiply + Add
    GFlops = 20.496
    Result = 2.17797e+06

Double-Precision - 128-bit AVX - Multiply + Add
    GFlops = 10.248
    Result = 1.08383e+06

Single-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 40.128
    Result = 2.5594e+06

Double-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 20.112
    Result = 1.29547e+06

Single-Precision - 256-bit AVX - Add/Sub
    GFlops = 20.16
    Result = 2.58065e+06

Double-Precision - 256-bit AVX - Add/Sub
    GFlops = 10.016
    Result = 1.26631e+06

Single-Precision - 256-bit AVX - Multiply
    GFlops = 33.504
    Result = 4.32432e+06

Double-Precision - 256-bit AVX - Multiply
    GFlops = 16.272
    Result = 2.09449e+06

Single-Precision - 256-bit AVX - Multiply + Add
    GFlops = 40.128
    Result = 4.30386e+06

Double-Precision - 256-bit AVX - Multiply + Add
    GFlops = 19.824
    Result = 2.0936e+06

Single-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 79.488
    Result = 5.09505e+06

Double-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 39.648
    Result = 2.53214e+06


Running Haswell tuned binary with 36 thread(s)...

Single-Precision - 128-bit AVX - Add/Sub
    GFlops = 189.344
    Result = 2.42138e+07

Double-Precision - 128-bit AVX - Add/Sub
    GFlops = 102.528
    Result = 1.30681e+07

Single-Precision - 128-bit AVX - Multiply
    GFlops = 402.48
    Result = 5.13804e+07

Double-Precision - 128-bit AVX - Multiply
    GFlops = 201.6
    Result = 2.57055e+07

Single-Precision - 128-bit AVX - Multiply + Add
    GFlops = 409.488
    Result = 4.35371e+07

Double-Precision - 128-bit AVX - Multiply + Add
    GFlops = 203.016
    Result = 2.16019e+07

Single-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 737.184
    Result = 4.70881e+07

Double-Precision - 128-bit FMA3 - Fused Multiply Add
    GFlops = 366.816
    Result = 2.33993e+07

Single-Precision - 256-bit AVX - Add/Sub
    GFlops = 367.424
    Result = 4.69662e+07

Double-Precision - 256-bit AVX - Add/Sub
    GFlops = 183.424
    Result = 2.33839e+07

Single-Precision - 256-bit AVX - Multiply
    GFlops = 709.824
    Result = 9.05564e+07

Double-Precision - 256-bit AVX - Multiply
    GFlops = 355.152
    Result = 4.52615e+07

Single-Precision - 256-bit AVX - Multiply + Add
    GFlops = 698.88
    Result = 7.43055e+07

Double-Precision - 256-bit AVX - Multiply + Add
    GFlops = 361.92
    Result = 3.85335e+07

Single-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 1469.95
    Result = 9.36513e+07

Double-Precision - 256-bit FMA3 - Fused Multiply Add
    GFlops = 730.176
    Result = 4.65468e+07


Press any key to continue . . .

Questions:

In the MultiThreads output, GFlops value representing avg. per thread, OR total?
Using IPDT64 - Revision: 4.0.0.29 On same machine (2 tries):
math_fp.exe -s 10 -resultName math_fp_Test.txt

--- Floating Point Test ---
...
Version: 1.0.11.64b.W
...

AVX is supported in your OS
Max AVX supported = AVX2
FMA3 supported
MFLOPS            CycleRun       Error       Time(sec)
118.04              16098          0            10

Million Floating Points per Second, MFLOPS --> 118.04
Error --> 0

Floating Point Test Passed!!!

--- Floating Point Test ---
...
Version: 1.0.11.64b.W
...

AVX is supported in your OS
Max AVX supported = AVX2
FMA3 supported
MFLOPS            CycleRun       Error       Time(sec)
36.335              14093          0            10

Million Floating Points per Second, MFLOPS --> 36.335
Error --> 0

Floating Point Test Passed!!!

from 118 to 36 is Big Jump! with your long experince with FLOPS, is that normal or it indicate a bug in Intel math_fp.exe? (yes I should ask Intel about that, but no harm from hearing your opinion).

Alexander Yee · Answer 2 · Sat Jun 17 2017 15:23:47 GMT+0800 (China Standard Time)

Any Estimate when this feature will be ready?

No idea yet. I'm not actively working on this atm. This is not a trivial feature and I'm not sure when I'll have the time to loop back to this.

In the MultiThreads output, GFlops value representing avg. per thread, OR total?

Total of all threads.

with your long experince with FLOPS, is that normal or it indicate a bug in Intel math_fp.exe?

I'm not at all familiar with that benchmark so I wouldn't know. It's probably a bug in the benchmark.