BLIS DGEMM performance decreases with increasing threads

Question

BLIS DGEMM performance decreases with increasing threads

babreu-ncsa opened this issue 3 years ago · comments

Hello,

I am trying to optimize a Fortran code that relies heavily on BLAS DGEMM calls by using threaded-BLIS. I am following somehow this example from the AMD Developer webpage (page 7 of the pdf). Here's my code:

program amd_dgemm
      use, intrinsic :: iso_fortran_env
      implicit none
      integer, parameter :: dp = REAL64 ! double precision float
      integer, parameter :: i32 = INT32 ! 32-bit integer
      integer(i32), parameter :: ord1=4000_i32  ! leading dim of matrix
      integer(i32), parameter :: ord2=2000_i32   ! lower dim of matrix
      real(dp) :: startT, endT
      real(dp), dimension(:,:), allocatable :: m, v, p
      integer(i32) :: i

      ! allocate
      allocate(m(ord1, ord2))
      allocate(v(ord2,1))
      allocate(p(ord1,1))

      ! fill in with random stuff
      call random_seed()
      call random_number(m)
      call random_number(v)
      p = 0.0_dp

      ! query
      call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)

      ! now time it
      call cpu_time(startT)
      do i = 1, 100
              call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
      enddo
      call cpu_time(endT)

      PRINT *, "== Matrix multiplication using AMD BLIS DGEMM =="
      PRINT 50, " == completed at ",(endT-startT)*1000," milliseconds =="
 50   FORMAT(A,F12.5,A)
      PRINT *, ""

end program amd_dgemm

Here's how I'm compiling this:

BLIS_PREFIX	= /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen2/gcc-10.2.0/amdblis-2.2-jcmqdhq7cozl3yw3xkvcec4xsbt7o7kl
BLIS_INC	= $(BLIS_PREFIX)/include/blis
BLIS_LIB	= $(BLIS_PREFIX)/lib/libblis-mt.a

OTHER_LIBS	= -lm -lpthread -fopenmp

FC		= gfortran
CFLAGS		= -I$(BLIS_INC)
LINKER		= $(FC)

OBJS		= amd_dgemm.o

%.o: %.f90
	$(FC) $(CFLAGS) -c $< -o $@

all: $(OBJS)
	$(LINKER) $(OBJS) $(BLIS_LIB) $(OTHER_LIBS) -o amd_dgemm.x

Then, I am running this on an AMD EPYC 7742 node, claiming 4 cores. Here's what I get as output when playing with BLIS_NUM_THREADS:

[babreu@exp-9-55 amd_dgemm]$ export BLIS_NUM_THREADS=1
[babreu@exp-9-55 amd_dgemm]$ ./amd_dgemm.x 
 == Matrix multiplication using AMD BLIS DGEMM ==
 == completed at    369.05500 milliseconds ==
 
[babreu@exp-9-55 amd_dgemm]$ export BLIS_NUM_THREADS=2
[babreu@exp-9-55 amd_dgemm]$ ./amd_dgemm.x 
 == Matrix multiplication using AMD BLIS DGEMM ==
 == completed at  15869.85200 milliseconds ==
 
[babreu@exp-9-55 amd_dgemm]$ export BLIS_NUM_THREADS=3
[babreu@exp-9-55 amd_dgemm]$ ./amd_dgemm.x 
 == Matrix multiplication using AMD BLIS DGEMM ==
 == completed at  21974.86700 milliseconds ==
 
[babreu@exp-9-55 amd_dgemm]$ export BLIS_NUM_THREADS=4
[babreu@exp-9-55 amd_dgemm]$ ./amd_dgemm.x 
 == Matrix multiplication using AMD BLIS DGEMM ==
 == completed at  25585.99200 milliseconds ==

Do you have any suggestions on what might be happening? Any comments would be helpful. I've also tried different OMP_PROC_BIND and OMP_PLACES, as suggested in the documents here on github, but it didn't change anything.
Thank you!

Jeff Hammond · Answer 1 · Wed Sep 15 2021 17:28:57 GMT+0800 (China Standard Time)

i'll try later but you might use taskset (see https://www.glennklockwood.com/hpc-howtos/process-affinity.html for useful examples) to make sure the affinity is proper.

RuQing Xu · Answer 2 · Thu Sep 16 2021 03:18:29 GMT+0800 (China Standard Time)

As far as what I see from this Fortran code, it is in fact a GEMV call? (N=1)

If that is the case the procedure is memory-bound and is hardly likely to benefit from multithreading.

Bruno Abreu · Answer 3 · Thu Sep 16 2021 03:48:54 GMT+0800 (China Standard Time)

As far as what I see from this Fortran code, it is in fact a GEMV call? (N=1)

If that is the case the procedure is memory-bound and is hardly likely to benefit from multithreading.

@xrq-phys Thanks for the comment, but I don't think that that is the case. The exact same code benefits from other threaded-BLAS implementations in different architectures (see this Intel community post, for instance)

Devin Matthews · Answer 4 · Thu Sep 16 2021 03:54:41 GMT+0800 (China Standard Time)

MKL is probably calling GEMV internally. BLIS *could* do this but currently does not. Also note that BLIS does not currently have multi-threaded level-2 BLAS.

Bruno Abreu · Answer 5 · Thu Sep 16 2021 04:00:43 GMT+0800 (China Standard Time)

@devinamatthews Thank you! I'm still a bit confused on how the clock time can increase from 369 ms to 15870 ms when BLIS_NUM_THREADS goes from 1 to 2, do you have any insights?

Devin Matthews · Answer 6 · Thu Sep 16 2021 04:09:54 GMT+0800 (China Standard Time)

Yes: you have a GEMM with m=4000, n=1, k=2000 and column-major storage. On AMD architectures this is internally recast as m=1, n=4000, k=2000 with row-major storage. m=1 is critically fatal to performance in BLIS because we will be repeatedly and in parallel "packing" a 1x256 matrix (k=2000 is split into chunks of size <= 256) into a 6x256 buffer with zero-padding. This adds a tremendous amount of overhead and the miniscule amount of actual work available leads to high thread contention (and maybe even false sharing because of the small sizes?). A threaded GEMV can avoid these pitfalls by using a different algorithm and avoiding packing.

Bruno Abreu · Answer 7 · Thu Sep 16 2021 04:26:19 GMT+0800 (China Standard Time)

@devinamatthews Thanks again for your time! I'm going to experiment with GEMV, instead. Closing this issue.

Bruno Abreu · Answer 8 · Thu Sep 16 2021 05:36:44 GMT+0800 (China Standard Time)

I may just have to dive deeper in the documents to figure out how to use this properly, but here is as a true DGEMM with M=N=K=10000, to keep this thread documented.

program amd_true_dgemm
      use, intrinsic :: iso_fortran_env
      implicit none
      integer, parameter :: dp = REAL64 ! double precision float
      integer, parameter :: i32 = INT32 ! 32-bit integer
      integer(i32), parameter :: ord1=10000_i32  ! leading dim of matrix A
      integer(i32), parameter :: ord2=10000_i32   ! lower dim of matrix A
      integer(i32), parameter :: ord3=10000_i32   ! other dim of B
      real(dp) :: startT, endT
      real(dp), dimension(:,:), allocatable :: m, v, p
      integer(i32) :: i

      ! allocate
      allocate(m(ord1,ord2))
      allocate(v(ord2,ord3))
      allocate(p(ord1,ord3))

      ! fill in with random stuff
      call random_seed()
      call random_number(m)
      call random_number(v)
      p = 0.0_dp

      ! call AMD BLIS (syntax below, first call usually a query)
      ! dgemm('N', 'N', M, N, K, ALPHA, A, M, B, K, BETA, C, M)
      call dgemm('N', 'N', ord1, ord3, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)

      ! now time it
      call cpu_time(startT)
      do i = 1, 1
              call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
      enddo
      call cpu_time(endT)

      PRINT *, "== Matrix multiplication using AMD BLIS DGEMM =="
      PRINT 50, " == completed at ",(endT-startT)*1000," milliseconds =="
 50   FORMAT(A,F12.5,A)
      PRINT *, ""

end program amd_true_dgemm

A few outputs:

export BLIS_NUM_THREADS=1

[babreu@exp-9-56 amd_dgemm]$ ./amd_dgemm.x 
 == Matrix multiplication using AMD BLIS DGEMM ==
 == completed at     48.88309 milliseconds ==

export BLIS_NUM_THREADS=4

[babreu@exp-9-56 amd_dgemm]$ ./amd_dgemm.x 
 == Matrix multiplication using AMD BLIS DGEMM ==
 == completed at     319.42300 milliseconds ==

with threads for some reason going to the same PSR:

   PID    TID COMMAND  USER     PSR
 95180      - amd_dgem babreu     -
     -  95180 -        babreu     0
     -  95181 -        babreu     0
     -  95182 -        babreu     0
     -  95183 -        babreu     0

export BLIS_NUM_THREADS=4 with numactl

[babreu@exp-9-55 amd_dgemm]$ numactl --cpunodebind=0 ./amd_dgemm.x
== Matrix multiplication using AMD BLIS DGEMM ==
== completed at     26.06485 milliseconds ==

with threads spread over

   PID    TID COMMAND  USER     PSR
100716      - amd_dgem babreu     -
     - 100716 -        babreu     2
     - 100717 -        babreu     4
     - 100718 -        babreu     8
     - 100719 -        babreu    12

... so... I guess I'm going on the right direction? Thank you all!

Devin Matthews · Answer 9 · Thu Sep 16 2021 06:50:47 GMT+0800 (China Standard Time)

Looks better. Try running >= 3 times and taking the best. This helps to get rid of noise like spurious page faults, interrupts, etc.