BLIS DGEMM performance decreases with increasing threads
babreu-ncsa opened this issue · comments
Hello,
I am trying to optimize a Fortran code that relies heavily on BLAS DGEMM calls by using threaded-BLIS. I am following somehow this example from the AMD Developer webpage (page 7 of the pdf). Here's my code:
program amd_dgemm
use, intrinsic :: iso_fortran_env
implicit none
integer, parameter :: dp = REAL64 ! double precision float
integer, parameter :: i32 = INT32 ! 32-bit integer
integer(i32), parameter :: ord1=4000_i32 ! leading dim of matrix
integer(i32), parameter :: ord2=2000_i32 ! lower dim of matrix
real(dp) :: startT, endT
real(dp), dimension(:,:), allocatable :: m, v, p
integer(i32) :: i
! allocate
allocate(m(ord1, ord2))
allocate(v(ord2,1))
allocate(p(ord1,1))
! fill in with random stuff
call random_seed()
call random_number(m)
call random_number(v)
p = 0.0_dp
! query
call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
! now time it
call cpu_time(startT)
do i = 1, 100
call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
enddo
call cpu_time(endT)
PRINT *, "== Matrix multiplication using AMD BLIS DGEMM =="
PRINT 50, " == completed at ",(endT-startT)*1000," milliseconds =="
50 FORMAT(A,F12.5,A)
PRINT *, ""
end program amd_dgemm
Here's how I'm compiling this:
BLIS_PREFIX = /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen2/gcc-10.2.0/amdblis-2.2-jcmqdhq7cozl3yw3xkvcec4xsbt7o7kl
BLIS_INC = $(BLIS_PREFIX)/include/blis
BLIS_LIB = $(BLIS_PREFIX)/lib/libblis-mt.a
OTHER_LIBS = -lm -lpthread -fopenmp
FC = gfortran
CFLAGS = -I$(BLIS_INC)
LINKER = $(FC)
OBJS = amd_dgemm.o
%.o: %.f90
$(FC) $(CFLAGS) -c $< -o $@
all: $(OBJS)
$(LINKER) $(OBJS) $(BLIS_LIB) $(OTHER_LIBS) -o amd_dgemm.x
Then, I am running this on an AMD EPYC 7742 node, claiming 4 cores. Here's what I get as output when playing with BLIS_NUM_THREADS
:
[babreu@exp-9-55 amd_dgemm]$ export BLIS_NUM_THREADS=1
[babreu@exp-9-55 amd_dgemm]$ ./amd_dgemm.x
== Matrix multiplication using AMD BLIS DGEMM ==
== completed at 369.05500 milliseconds ==
[babreu@exp-9-55 amd_dgemm]$ export BLIS_NUM_THREADS=2
[babreu@exp-9-55 amd_dgemm]$ ./amd_dgemm.x
== Matrix multiplication using AMD BLIS DGEMM ==
== completed at 15869.85200 milliseconds ==
[babreu@exp-9-55 amd_dgemm]$ export BLIS_NUM_THREADS=3
[babreu@exp-9-55 amd_dgemm]$ ./amd_dgemm.x
== Matrix multiplication using AMD BLIS DGEMM ==
== completed at 21974.86700 milliseconds ==
[babreu@exp-9-55 amd_dgemm]$ export BLIS_NUM_THREADS=4
[babreu@exp-9-55 amd_dgemm]$ ./amd_dgemm.x
== Matrix multiplication using AMD BLIS DGEMM ==
== completed at 25585.99200 milliseconds ==
Do you have any suggestions on what might be happening? Any comments would be helpful. I've also tried different OMP_PROC_BIND
and OMP_PLACES
, as suggested in the documents here on github, but it didn't change anything.
Thank you!
i'll try later but you might use taskset (see https://www.glennklockwood.com/hpc-howtos/process-affinity.html for useful examples) to make sure the affinity is proper.
As far as what I see from this Fortran code, it is in fact a GEMV call? (N=1)
If that is the case the procedure is memory-bound and is hardly likely to benefit from multithreading.
As far as what I see from this Fortran code, it is in fact a GEMV call? (N=1)
If that is the case the procedure is memory-bound and is hardly likely to benefit from multithreading.
@xrq-phys Thanks for the comment, but I don't think that that is the case. The exact same code benefits from other threaded-BLAS implementations in different architectures (see this Intel community post, for instance)
@devinamatthews Thank you! I'm still a bit confused on how the clock time can increase from 369 ms to 15870 ms when BLIS_NUM_THREADS goes from 1 to 2, do you have any insights?
Yes: you have a GEMM with m=4000, n=1, k=2000 and column-major storage. On AMD architectures this is internally recast as m=1, n=4000, k=2000 with row-major storage. m=1 is critically fatal to performance in BLIS because we will be repeatedly and in parallel "packing" a 1x256 matrix (k=2000 is split into chunks of size <= 256) into a 6x256 buffer with zero-padding. This adds a tremendous amount of overhead and the miniscule amount of actual work available leads to high thread contention (and maybe even false sharing because of the small sizes?). A threaded GEMV can avoid these pitfalls by using a different algorithm and avoiding packing.
@devinamatthews Thanks again for your time! I'm going to experiment with GEMV, instead. Closing this issue.
I may just have to dive deeper in the documents to figure out how to use this properly, but here is as a true DGEMM with M=N=K=10000, to keep this thread documented.
program amd_true_dgemm
use, intrinsic :: iso_fortran_env
implicit none
integer, parameter :: dp = REAL64 ! double precision float
integer, parameter :: i32 = INT32 ! 32-bit integer
integer(i32), parameter :: ord1=10000_i32 ! leading dim of matrix A
integer(i32), parameter :: ord2=10000_i32 ! lower dim of matrix A
integer(i32), parameter :: ord3=10000_i32 ! other dim of B
real(dp) :: startT, endT
real(dp), dimension(:,:), allocatable :: m, v, p
integer(i32) :: i
! allocate
allocate(m(ord1,ord2))
allocate(v(ord2,ord3))
allocate(p(ord1,ord3))
! fill in with random stuff
call random_seed()
call random_number(m)
call random_number(v)
p = 0.0_dp
! call AMD BLIS (syntax below, first call usually a query)
! dgemm('N', 'N', M, N, K, ALPHA, A, M, B, K, BETA, C, M)
call dgemm('N', 'N', ord1, ord3, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
! now time it
call cpu_time(startT)
do i = 1, 1
call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
enddo
call cpu_time(endT)
PRINT *, "== Matrix multiplication using AMD BLIS DGEMM =="
PRINT 50, " == completed at ",(endT-startT)*1000," milliseconds =="
50 FORMAT(A,F12.5,A)
PRINT *, ""
end program amd_true_dgemm
A few outputs:
export BLIS_NUM_THREADS=1
[babreu@exp-9-56 amd_dgemm]$ ./amd_dgemm.x
== Matrix multiplication using AMD BLIS DGEMM ==
== completed at 48.88309 milliseconds ==
export BLIS_NUM_THREADS=4
[babreu@exp-9-56 amd_dgemm]$ ./amd_dgemm.x
== Matrix multiplication using AMD BLIS DGEMM ==
== completed at 319.42300 milliseconds ==
with threads for some reason going to the same PSR:
PID TID COMMAND USER PSR
95180 - amd_dgem babreu -
- 95180 - babreu 0
- 95181 - babreu 0
- 95182 - babreu 0
- 95183 - babreu 0
export BLIS_NUM_THREADS=4
withnumactl
[babreu@exp-9-55 amd_dgemm]$ numactl --cpunodebind=0 ./amd_dgemm.x
== Matrix multiplication using AMD BLIS DGEMM ==
== completed at 26.06485 milliseconds ==
with threads spread over
PID TID COMMAND USER PSR
100716 - amd_dgem babreu -
- 100716 - babreu 2
- 100717 - babreu 4
- 100718 - babreu 8
- 100719 - babreu 12
... so... I guess I'm going on the right direction? Thank you all!
Looks better. Try running >= 3 times and taking the best. This helps to get rid of noise like spurious page faults, interrupts, etc.