flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance issue of dgemm on Gold 6230R CPU

FuncJ opened this issue · comments

Hi, I have some questions about the performance of dgemm on Intel(R) Xeon(R) Gold 6230R CPU. I follow the documentation available on https://github.com/flame/blis/blob/master/docs/Performance.md. I just want to compare the performance of dgemm on my machine with the results of skylakex that have been shown in the documentation. There is more detailed information below.

My Machine
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 104
On-line CPU(s) list: 0-103
Thread(s) per core: 2
Core(s) per socket: 26
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
Stepping: 7
CPU MHz: 2100.000
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-25,52-77
NUMA node1 CPU(s): 26-51,78-103

Core topology: two sockets, 26 cores per socket, 52 cores total
SMT status: enabled, but not utilized
Max clock rate: 2.1GHz (single-core and multicore)
Peak performance:
--single-core: 67.2 GFLOPS(double-precision)
--multicore: 67.2 GFLOPS/core (double-precision)
I have fixed the frequency of the CPU at 2.1GHz by commands: sudo cpupower -c all frequency-set -u 2.1GHz, sudo cpupower -c all frequency-set -d 2.1GHz

The BLIS version number
0.9.0(commit 0ab20c0

How I configured BLIS
./configure --enable-cblas -t openmp -p /my_path_to/blis skx

Snipaste_2022-05-05_21-39-33

Snipaste_2022-05-05_21-41-46

OS
Linux version 3.10.0-1127.10.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) )

Compiler
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC). I have upgraded the version of the compiler to 8.3.1

Code sample
I have referenced your code and modified it appropriately to test the performance on my machine. Thanks a lot. Here is my command to compile the source code:
gcc -O2 -o test_bli_dgemm.x test_bli_dgemm.c -I /my_path_to/blis /my_path_to/blis/lib/libblis.a -lm -fopenmp

#include <stdio.h>
#include <sys/time.h>

#include "blis.h"

int main( int argc, char** argv )
{
      dim_t m, n, k;
      inc_t rsa, csa;
      inc_t rsb, csb;
      inc_t rsc, csc;
      
      double *a;
      double *b;
      double *c;
      double alpha, beta;
      
      // Initialize some basic constants.
      double zero = 0.0;
      double one  = 1.0;
      double two  = 2.0;
      
      // Create some matrix and vector operands to work with.
      m = atoi(argv[1]);
      n = atoi(argv[2]);
      k = atoi(argv[3]);
      
      rsc = 1; csc = m;
      rsa = 1; csa = m;
      rsb = 1; csb = k;
      c = malloc( m * n * sizeof( double ) );
      a = malloc( m * k * sizeof( double ) );
      b = malloc( k * n * sizeof( double ) );
      
      // Set the scalars to use.
      alpha = 1.2;
      beta  = 0.02;
      
      // Initialize the matrix operands.
      bli_drandm( 0, BLIS_DENSE, m, k, a, rsa, csa );
      bli_dsetm( BLIS_NO_CONJUGATE, 0, BLIS_NONUNIT_DIAG, BLIS_DENSE,
           k, n, &one, b, rsb, csb );
      bli_dsetm( BLIS_NO_CONJUGATE, 0, BLIS_NONUNIT_DIAG, BLIS_DENSE,
           m, n, &zero, c, rsc, csc );
      
      printf("m=%d,n=%d,k=%d,alpha=%lf,beta=%lf\n",m,n,k,alpha,beta);
      
      
      int loop = 10;
      struct timeval start,finish;
      double duration, shortest_time = 10e9;
      
      // c := beta * c + alpha * a * b, where 'a', 'b', and 'c' are general.
      for(int i = 0; i < loop; ++ i)
      {
            gettimeofday(&start, NULL);
            bli_dgemm( BLIS_NO_TRANSPOSE, BLIS_NO_TRANSPOSE,
	               m, n, k, &alpha, a, rsa, csa, b, rsb, csb,
	                        &beta, c, rsc, csc );
            gettimeofday(&finish, NULL);
            
            duration = ((double)(finish.tv_sec-start.tv_sec)*1000000 + (double)(finish.tv_usec-start.tv_usec)) / 1000000;
            
            if(duration < shortest_time)
                shortest_time = duration;
      }
      
      double gflops = 2.0 * m * n * k;
      gflops = gflops/shortest_time *1.0e-9;
      
      FILE *fp;
      fp = fopen("timeDGEMM.txt", "a");
      fprintf(fp, "%dx%dx%d\t%lf s\t%lf GFLOPS\n", m, n, k, shortest_time , gflops);
      fclose(fp);
      
      // Free the memory obtained via malloc().
      free( a );
      free( b );
      free( c );
      
      return 0;
}

The dgemm performance on my machine

  • Single-threaded (1 core) execution

./test_bli_dgemm.x 2000 2000 2000
m=n=k=2000, shortest time=0.273947s, GFLOPS=58.405458, Efficiency=58.4/67.2=0.8690
I think my result is roughly the same as the result shown in the documentation.

  • Multithreaded (26 core) execution requested via export BLIS_JC_NT=2 BLIS_IC_NT=13 and export GOMP_CPU_AFFINITY="0-25:1"

./test_bli_dgemm.x 2000 2000 2000
m=n=k=2000, shortest time=0.016715s, GFLOPS/core=957.224050/26=36.81615, Efficiency=36.81615/67.2=0.5478
I think my experimental result is a little bit different from what the documentation shows.

./test_bli_dgemm.x 4000 4000 4000
m=n=k=4000, shortest time=0.317475s, GFLOPS/core=423.002058/26=16.2693, Efficiency=16.2693/67.2=0.242
I think my experimental result is completely different from what the documentation shows.

  • Multithreaded (52 core) execution requested via export BLIS_JC_NT=4 BLIS_IC_NT=13 and export GOMP_CPU_AFFINITY="0-51:1"

./test_bli_dgemm.x 2000 2000 2000
m=n=k=2000, shortest time=0.013306s, GFLOPS/core=1202.465053/52=23.124327, Efficiency=23.124327/67.2=0.344
I think my experimental result is completely different from what the documentation shows.

./test_bli_dgemm.x 4000 4000 4000
m=n=k=4000, shortest time=0.362070s, GFLOPS/core=421.2220903/52=8.1004, Efficiency=8.1004/67.2=0.12054
I think my experimental result is completely different from what the documentation shows.

My Questions

  1. Why do I increase the size of the matrix and the performance drops by half?
  2. Why double the number of threads, the performance improvement is not very obvious?

Thanks a lot.

Yes this is quite odd. Some things to try:

  1. Use beta = 0.0
  2. Zero out A and B
  3. Try BLIS_JR_NT=XXX instead of BLIS_JC_NT.
  4. Make a more detailed plot of performance vs. m=n=k (e.g. 200 to 4000 in steps of 200). This might reveal patterns which point to the problem.

Also, it looks like your thread pinning is correct but you might try OMP_PLACES=CORES OMP_PROC_BIND=CLOSE.

I agree that your results are puzzling, @FuncJ. I can't see anything that is super concerning to me in the test driver code. That leaves many other variables, though.

A few suggestions:

  1. I notice you use gettimeofday(). Perhaps you can try doing what we do, which is using bli_clock() and bli_clock_min_diff(), both of which use clock_gettime(). You can look at test/test_gemm.c to see these functions in action. I wouldn't expect this to make much of a difference, but it would be nice to rule it out.
  2. I couldn't help but notice that while you fixed the CPU clock frequency, it didn't appear that you changed the CPU frequency governor. When I did my experiments, I always do both (changing the governor to performance). Again, probably not an issue, but let's try to rule it out.
  3. As @devinamatthews said, beta == 0 would be interesting because it would lower the bandwidth requirements for accessing C (in case that is somehow a bottleneck on your system), since C is only written rather than read+written.
  4. I 100% agree with @devinamatthews that performance plots tell a richer story than single data points. The shape of the curve will be interesting to see.
  5. I also think it would be interesting to plot performance curves for various numbers of threads/cores. What happens when n_threads is 2, 4, 6, 8, or 12? You can use m,n,k=2000, or as high as you can go without it taking forever to finish.

Thanks a lot for your suggestions. I will follow your suggestions and redesign the experiment.

Can you post a graph please?

core_26
core_52
This time I followed the documentation available at https://github.com/flame/blis/blob/master/docs/Performance.md. I used the code you provided in the directory /test/3. The peak performance of a is 64 GFLOPS(double) per core.I really don't know which crucial experimental step I got wrong. Thanks.

I'm not sure you got anything wrong. This is quite odd though. @field? @jdiamondGitHub?

I will note that the performance collapse seems to occur just when the size of one matrix == 64GiB. Accesses to DRAM vs. L3 may be the bottleneck at higher sizes.

I am also very suspicious that the peak is very close to 50% of peak. I'm wondering if a) BLIS is not actually using the skx kernel somehow, or b) the machine is not running with 2 AVX-512 VPUs for some reason. In the latter case, there is a piece of assembly code which I cannot seem to find (@jeffhammond?) which can tell you for sure how many VPUs are active.

6230 should have 2. Refresh (6230R) shouldn't change it.

The empirical.x program in the vpu-count repo will give a definitive answer. I can't think of any other reasons why performance is so poor.

core_8
Thanks a lot for your suggestions. I have run the program ‘empirical.x’. The output of the program is 'vpu=2'. I have inserted the 'printf' statement in the 'bli_dgemm_skx_asm_16x14.c' file and recompiled the blis source code. This graph shows the performance when the number of cores is 8. I guess that memory bandwidth and cache may be performance bottlenecks.

OK. What is your memory configuration (# and arrangement of DIMMs)?

OK. What is your memory configuration (# and arrangement of DIMMs)?

I refer to the information provided by the website https://en.wikichip.org/wiki/intel/xeon_gold/6230r. Max bandwidht == 131.13Gib/s. Could you please tell me which command line should I use?

If you have administrator access, then sudo dmidecode -t 17, otherwise you'll have to ask your system administrator. I ask because that chip has 6-channel memory, so you need exactly 12 (or 24) identical memory DIMMs, balanced 6 (or 12) on each socket, to fully saturate the memory bandwidth. You can also measure the memory bandwidth empirically using the STREAM benchmark. Hopefully your system is set up properly.

That's plenty of bandwidth. Have you tried MKL? I'm assuming at this point that it stomps all over BLIS...

Those STREAM numbers are bogus. You cannot get that from DRAM. You set the size too small and it fits in cache.

@FuncJ I guess you deleted uyour post with the MKL numbers? I have no problem posting that here. From those results though, since they more or less follow the BLIS curve by ~15-20% higher, I think this is unlikely to be a BLIS-specific problem. My money is on a hardware misconfiguration (e.g. the distribution of DIMMs that I mentioned above) or maybe a thermal issue? You might get some help on the Intel or MKL forums. Good luck!

I'm going to close this unless someone can reproduce on another system.