flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Poor dgemm scaling with number of threads

bandokihiro opened this issue · comments

I have an application in which I do relatively large dgemm of particular shapes using multi-threaded BLAS implementations on CPUs. I am interested in getting good performance on AMD Epyc 7742 64-core processors. Based on my research, the blis library provides the highest performance and I have reproduced the results for square dgemm in the wiki (see plot below). The command line I used for this is

OMP_NUM_THREADS=64 OMP_PROC_BIND=true numactl --interleave 0-3 --physcpubind 0-63 -- ./a.out

Figure 1

Now my application doesn't scale well with the number of threads. I investigated this, using the same kind of benchmark, how dgemm scales for the particular shapes my application uses. The scaling plot is found below:
AMDRome_benchmark_solver_gemm
The same command line was used and the value of the env variable OMP_NUM_THREADS was varied. I wanted to know if this looks normal and, if not, how I can improve this. The shapes of the matrices are a small operator matrix on the left and a fat matrix on the right. Below are the exact shapes ((M,K,N) for MxK x KxN):

0: 125,125,320000
1: 375,125,320000
2: 125,375,320000 (left is transposed)
3: 150,125,320000
4: 450,125,320000
5: 125,150,320000 (left is transposed)
6: 125,450,320000 (left is transposed)

Thank you very much for your help.

What are the storage formats of the A, B, and C matrices (row- or column-major)? Certain combinations will lead to sub-optimal performance due to the storage preference of the microkernel and the way the algorithm is structured.

A, B, and C are all row major.

Yup. m << n with row-major C is the problem case. If you can switch C to column-major that might be a quick fix. Longer-term we need to add some additional flexibility to the algorithm for non-square GEMMs.

If you would like to participate in an experiment I can make you a branch to try with some modifications.

ok, thank you very much. trying it with the benchmark above is quick so I'll try and report what I get today.

If you would like to participate in an experiment I can make you a branch to try with some modifications.

sure, I am down

AMDRome_benchmark_solver_gemm_blis_LayoutLeft
The numbers in column-major setting.

Definitely looks better. Your problems are probably somewhat memory-bound so you may not be able to do better. I'll try and set up a branch for you try test out early next week.

Thanks. Is your experiement something that you expect you will be able to stably merge into master relatively easily? If your changes allow me to get the same kind of performance with a row-major ordering, I would hold off making a bunch of changes in my code to make the layout change.

@bandokihiro I will make branch(es) after meeting with @fgvanzee tomorrow. I want to make sure I test things which can potentially be incorporated into the master branch. Do note that it might take some time to get merged since we will have to do additional benchmarking etc. Thanks for the bug report and your willingness to be a guinea pig.

@bandokihiro what version of BLIS are you using now?

@bandokihiro please try the branch gemmsup-bp-flip. If you could run both row- and column-major experiments that should provide some valuable information.

I am on the latest tag so 0.9.0. I'll try to run this today, thanks.

figure
column-major / row-major

row-major improved but column-major's performance degraded

@bandokihiro I'm trying to get access to the 1 (!) AMD machine on campus. If not I may ask you to run even more experiments.

@bandokihiro to that purpose, would you mind sharing your microbenchmark code?

I am using kokkos-kernels, which redirects the dgemm calls to third-party libraries. I don't interface with blis directly. The layout is controlled by a simple template argument of Kokkos views. I think the way it works under the hood is that the redirection is successful when the layouts of A, B, and C match. For one of them (column or row-major), all matrices are tranposed in the call to the library. Is that still of interest to you?

OK, I can write my own pretty easily.

@bandokihiro I was able to reproduce your performance graphs and do some additional experiments. I think we've found a simple solution which increases the performance of the row-major case so that it is similar to the column-major case. However we need to check that it doesn't have any unintended consequences.

Great, thank you very much! Let me know when I can try.

@bandokihiro please try the gemmsup-kc0 branch. If this gives good performance for both storage formats then we can go ahead and merge.

It looks pretty good. This figure summarizes all results. Left is column-major, right is row-major:
figure

Glad we were able to accomodate you on this, @bandokihiro. Thank you for helping @devinamatthews zero in on a solution.

No problem. Thank you very much for your assistance.