Performance issue of dgemm on Gold 6230R CPU

Question

Performance issue of dgemm on Gold 6230R CPU

FuncJ opened this issue 2 years ago · comments

Hi, I have some questions about the performance of dgemm on Intel(R) Xeon(R) Gold 6230R CPU. I follow the documentation available on https://github.com/flame/blis/blob/master/docs/Performance.md. I just want to compare the performance of dgemm on my machine with the results of skylakex that have been shown in the documentation. There is more detailed information below.

My Machine
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 104
On-line CPU(s) list: 0-103
Thread(s) per core: 2
Core(s) per socket: 26
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
Stepping: 7
CPU MHz: 2100.000
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-25,52-77
NUMA node1 CPU(s): 26-51,78-103

Core topology: two sockets, 26 cores per socket, 52 cores total
SMT status: enabled, but not utilized
Max clock rate: 2.1GHz (single-core and multicore)
Peak performance:
--single-core: 67.2 GFLOPS(double-precision)
--multicore: 67.2 GFLOPS/core (double-precision)
I have fixed the frequency of the CPU at 2.1GHz by commands: sudo cpupower -c all frequency-set -u 2.1GHz, sudo cpupower -c all frequency-set -d 2.1GHz

The BLIS version number
0.9.0（commit 0ab20c0）

How I configured BLIS
./configure --enable-cblas -t openmp -p /my_path_to/blis skx

OS
Linux version 3.10.0-1127.10.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) )

Compiler
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC). I have upgraded the version of the compiler to 8.3.1

Code sample
I have referenced your code and modified it appropriately to test the performance on my machine. Thanks a lot. Here is my command to compile the source code:
gcc -O2 -o test_bli_dgemm.x test_bli_dgemm.c -I /my_path_to/blis /my_path_to/blis/lib/libblis.a -lm -fopenmp

#include <stdio.h>
#include <sys/time.h>

#include "blis.h"

int main( int argc, char** argv )
{
      dim_t m, n, k;
      inc_t rsa, csa;
      inc_t rsb, csb;
      inc_t rsc, csc;
      
      double *a;
      double *b;
      double *c;
      double alpha, beta;
      
      // Initialize some basic constants.
      double zero = 0.0;
      double one  = 1.0;
      double two  = 2.0;
      
      // Create some matrix and vector operands to work with.
      m = atoi(argv[1]);
      n = atoi(argv[2]);
      k = atoi(argv[3]);
      
      rsc = 1; csc = m;
      rsa = 1; csa = m;
      rsb = 1; csb = k;
      c = malloc( m * n * sizeof( double ) );
      a = malloc( m * k * sizeof( double ) );
      b = malloc( k * n * sizeof( double ) );
      
      // Set the scalars to use.
      alpha = 1.2;
      beta  = 0.02;
      
      // Initialize the matrix operands.
      bli_drandm( 0, BLIS_DENSE, m, k, a, rsa, csa );
      bli_dsetm( BLIS_NO_CONJUGATE, 0, BLIS_NONUNIT_DIAG, BLIS_DENSE,
           k, n, &one, b, rsb, csb );
      bli_dsetm( BLIS_NO_CONJUGATE, 0, BLIS_NONUNIT_DIAG, BLIS_DENSE,
           m, n, &zero, c, rsc, csc );
      
      printf("m=%d,n=%d,k=%d,alpha=%lf,beta=%lf\n",m,n,k,alpha,beta);
      
      
      int loop = 10;
      struct timeval start,finish;
      double duration, shortest_time = 10e9;
      
      // c := beta * c + alpha * a * b, where 'a', 'b', and 'c' are general.
      for(int i = 0; i < loop; ++ i)
      {
            gettimeofday(&start, NULL);
            bli_dgemm( BLIS_NO_TRANSPOSE, BLIS_NO_TRANSPOSE,
	               m, n, k, &alpha, a, rsa, csa, b, rsb, csb,
	                        &beta, c, rsc, csc );
            gettimeofday(&finish, NULL);
            
            duration = ((double)(finish.tv_sec-start.tv_sec)*1000000 + (double)(finish.tv_usec-start.tv_usec)) / 1000000;
            
            if(duration < shortest_time)
                shortest_time = duration;
      }
      
      double gflops = 2.0 * m * n * k;
      gflops = gflops/shortest_time *1.0e-9;
      
      FILE *fp;
      fp = fopen("timeDGEMM.txt", "a");
      fprintf(fp, "%dx%dx%d\t%lf s\t%lf GFLOPS\n", m, n, k, shortest_time , gflops);
      fclose(fp);
      
      // Free the memory obtained via malloc().
      free( a );
      free( b );
      free( c );
      
      return 0;
}

The dgemm performance on my machine

Single-threaded (1 core) execution

./test_bli_dgemm.x 2000 2000 2000
m=n=k=2000, shortest time=0.273947s, GFLOPS=58.405458, Efficiency=58.4/67.2=0.8690
I think my result is roughly the same as the result shown in the documentation.

Multithreaded (26 core) execution requested via export BLIS_JC_NT=2 BLIS_IC_NT=13 and export GOMP_CPU_AFFINITY="0-25:1"

./test_bli_dgemm.x 2000 2000 2000
m=n=k=2000, shortest time=0.016715s, GFLOPS/core=957.224050/26=36.81615, Efficiency=36.81615/67.2=0.5478
I think my experimental result is a little bit different from what the documentation shows.

./test_bli_dgemm.x 4000 4000 4000
m=n=k=4000, shortest time=0.317475s, GFLOPS/core=423.002058/26=16.2693, Efficiency=16.2693/67.2=0.242
I think my experimental result is completely different from what the documentation shows.

Multithreaded (52 core) execution requested via export BLIS_JC_NT=4 BLIS_IC_NT=13 and export GOMP_CPU_AFFINITY="0-51:1"

./test_bli_dgemm.x 2000 2000 2000
m=n=k=2000, shortest time=0.013306s, GFLOPS/core=1202.465053/52=23.124327, Efficiency=23.124327/67.2=0.344
I think my experimental result is completely different from what the documentation shows.

./test_bli_dgemm.x 4000 4000 4000
m=n=k=4000, shortest time=0.362070s, GFLOPS/core=421.2220903/52=8.1004, Efficiency=8.1004/67.2=0.12054
I think my experimental result is completely different from what the documentation shows.

My Questions

Why do I increase the size of the matrix and the performance drops by half?
Why double the number of threads, the performance improvement is not very obvious?

Thanks a lot.

Devin Matthews commented 2 years ago

Duh

Devin Matthews · Answer 1 · Fri May 06 2022 02:13:45 GMT+0800 (China Standard Time)

Yes this is quite odd. Some things to try:

Use beta = 0.0
Zero out A and B
Try BLIS_JR_NT=XXX instead of BLIS_JC_NT.
Make a more detailed plot of performance vs. m=n=k (e.g. 200 to 4000 in steps of 200). This might reveal patterns which point to the problem.

Devin Matthews · Answer 2 · Fri May 06 2022 02:16:29 GMT+0800 (China Standard Time)

Also, it looks like your thread pinning is correct but you might try OMP_PLACES=CORES OMP_PROC_BIND=CLOSE.

Field G. Van Zee · Answer 3 · Tue May 10 2022 04:57:11 GMT+0800 (China Standard Time)

I agree that your results are puzzling, @FuncJ. I can't see anything that is super concerning to me in the test driver code. That leaves many other variables, though.

A few suggestions:

I notice you use gettimeofday(). Perhaps you can try doing what we do, which is using bli_clock() and bli_clock_min_diff(), both of which use clock_gettime(). You can look at test/test_gemm.c to see these functions in action. I wouldn't expect this to make much of a difference, but it would be nice to rule it out.
I couldn't help but notice that while you fixed the CPU clock frequency, it didn't appear that you changed the CPU frequency governor. When I did my experiments, I always do both (changing the governor to performance). Again, probably not an issue, but let's try to rule it out.
As @devinamatthews said, beta == 0 would be interesting because it would lower the bandwidth requirements for accessing C (in case that is somehow a bottleneck on your system), since C is only written rather than read+written.
I 100% agree with @devinamatthews that performance plots tell a richer story than single data points. The shape of the curve will be interesting to see.
I also think it would be interesting to plot performance curves for various numbers of threads/cores. What happens when n_threads is 2, 4, 6, 8, or 12? You can use m,n,k=2000, or as high as you can go without it taking forever to finish.

BoardMan · Answer 4 · Tue May 10 2022 11:08:15 GMT+0800 (China Standard Time)

Thanks a lot for your suggestions. I will follow your suggestions and redesign the experiment.

Devin Matthews · Answer 5 · Thu May 19 2022 12:09:31 GMT+0800 (China Standard Time)

Can you post a graph please?

BoardMan · Answer 6 · Thu May 26 2022 23:15:41 GMT+0800 (China Standard Time)

This time I followed the documentation available at https://github.com/flame/blis/blob/master/docs/Performance.md. I used the code you provided in the directory /test/3. The peak performance of a is 64 GFLOPS(double) per core.I really don't know which crucial experimental step I got wrong. Thanks.

Devin Matthews · Answer 7 · Fri May 27 2022 00:08:56 GMT+0800 (China Standard Time)

I'm not sure you got anything wrong. This is quite odd though. @field? @jdiamondGitHub?

Devin Matthews · Answer 8 · Fri May 27 2022 02:49:13 GMT+0800 (China Standard Time)

I will note that the performance collapse seems to occur just when the size of one matrix == 64GiB. Accesses to DRAM vs. L3 may be the bottleneck at higher sizes.

Devin Matthews · Answer 9 · Fri May 27 2022 03:02:09 GMT+0800 (China Standard Time)

I am also very suspicious that the peak is very close to 50% of peak. I'm wondering if a) BLIS is not actually using the skx kernel somehow, or b) the machine is not running with 2 AVX-512 VPUs for some reason. In the latter case, there is a piece of assembly code which I cannot seem to find (@jeffhammond?) which can tell you for sure how many VPUs are active.

Jeff Hammond · Answer 10 · Fri May 27 2022 03:03:27 GMT+0800 (China Standard Time)

https://github.com/jeffhammond/vpu-count

Jeff Hammond · Answer 11 · Fri May 27 2022 03:04:40 GMT+0800 (China Standard Time)

6230 should have 2. Refresh (6230R) shouldn't change it.

Devin Matthews · Answer 12 · Fri May 27 2022 04:58:17 GMT+0800 (China Standard Time)

The empirical.x program in the vpu-count repo will give a definitive answer. I can't think of any other reasons why performance is so poor.

BoardMan · Answer 13 · Sat May 28 2022 11:22:15 GMT+0800 (China Standard Time)

Thanks a lot for your suggestions. I have run the program ‘empirical.x’. The output of the program is 'vpu=2'. I have inserted the 'printf' statement in the 'bli_dgemm_skx_asm_16x14.c' file and recompiled the blis source code. This graph shows the performance when the number of cores is 8. I guess that memory bandwidth and cache may be performance bottlenecks.

Devin Matthews · Answer 14 · Sat May 28 2022 12:24:38 GMT+0800 (China Standard Time)

OK. What is your memory configuration (# and arrangement of DIMMs)?

BoardMan · Answer 15 · Sat May 28 2022 13:04:04 GMT+0800 (China Standard Time)

OK. What is your memory configuration (# and arrangement of DIMMs)?

I refer to the information provided by the website https://en.wikichip.org/wiki/intel/xeon_gold/6230r. Max bandwidht == 131.13Gib/s. Could you please tell me which command line should I use?

Devin Matthews · Answer 16 · Sat May 28 2022 13:32:10 GMT+0800 (China Standard Time)

If you have administrator access, then sudo dmidecode -t 17, otherwise you'll have to ask your system administrator. I ask because that chip has 6-channel memory, so you need exactly 12 (or 24) identical memory DIMMs, balanced 6 (or 12) on each socket, to fully saturate the memory bandwidth. You can also measure the memory bandwidth empirically using the STREAM benchmark. Hopefully your system is set up properly.

BoardMan · Answer 17 · Sat May 28 2022 15:26:43 GMT+0800 (China Standard Time)

BoardMan commented 2 years ago

Devin Matthews · Answer 18 · Sat May 28 2022 20:33:29 GMT+0800 (China Standard Time)

That's plenty of bandwidth. Have you tried MKL? I'm assuming at this point that it stomps all over BLIS...

Jeff Hammond · Answer 19 · Sun Jun 05 2022 22:28:03 GMT+0800 (China Standard Time)

Those STREAM numbers are bogus. You cannot get that from DRAM. You set the size too small and it fits in cache.

Devin Matthews · Answer 20 · Mon Jun 06 2022 00:33:42 GMT+0800 (China Standard Time)

@FuncJ I guess you deleted uyour post with the MKL numbers? I have no problem posting that here. From those results though, since they more or less follow the BLIS curve by ~15-20% higher, I think this is unlikely to be a BLIS-specific problem. My money is on a hardware misconfiguration (e.g. the distribution of DIMMs that I mentioned above) or maybe a thermal issue? You might get some help on the Intel or MKL forums. Good luck!

I'm going to close this unless someone can reproduce on another system.