Poor dgemm scaling with number of threads

Question

Poor dgemm scaling with number of threads

bandokihiro opened this issue 2 years ago · comments

I have an application in which I do relatively large dgemm of particular shapes using multi-threaded BLAS implementations on CPUs. I am interested in getting good performance on AMD Epyc 7742 64-core processors. Based on my research, the blis library provides the highest performance and I have reproduced the results for square dgemm in the wiki (see plot below). The command line I used for this is

OMP_NUM_THREADS=64 OMP_PROC_BIND=true numactl --interleave 0-3 --physcpubind 0-63 -- ./a.out

Now my application doesn't scale well with the number of threads. I investigated this, using the same kind of benchmark, how dgemm scales for the particular shapes my application uses. The scaling plot is found below:

The same command line was used and the value of the env variable OMP_NUM_THREADS was varied. I wanted to know if this looks normal and, if not, how I can improve this. The shapes of the matrices are a small operator matrix on the left and a fat matrix on the right. Below are the exact shapes ((M,K,N) for MxK x KxN):

0: 125,125,320000
1: 375,125,320000
2: 125,375,320000 (left is transposed)
3: 150,125,320000
4: 450,125,320000
5: 125,150,320000 (left is transposed)
6: 125,450,320000 (left is transposed)

Thank you very much for your help.

Devin Matthews · Answer 1 · Fri Jul 15 2022 01:00:58 GMT+0800 (China Standard Time)

What are the storage formats of the A, B, and C matrices (row- or column-major)? Certain combinations will lead to sub-optimal performance due to the storage preference of the microkernel and the way the algorithm is structured.

Kihiro Bando · Answer 2 · Fri Jul 15 2022 01:02:34 GMT+0800 (China Standard Time)

A, B, and C are all row major.

Devin Matthews · Answer 3 · Fri Jul 15 2022 01:04:22 GMT+0800 (China Standard Time)

Yup. m << n with row-major C is the problem case. If you can switch C to column-major that might be a quick fix. Longer-term we need to add some additional flexibility to the algorithm for non-square GEMMs.

Devin Matthews · Answer 4 · Fri Jul 15 2022 01:05:57 GMT+0800 (China Standard Time)

If you would like to participate in an experiment I can make you a branch to try with some modifications.

Kihiro Bando · Answer 5 · Fri Jul 15 2022 01:07:20 GMT+0800 (China Standard Time)

ok, thank you very much. trying it with the benchmark above is quick so I'll try and report what I get today.

If you would like to participate in an experiment I can make you a branch to try with some modifications.

sure, I am down

Kihiro Bando · Answer 6 · Fri Jul 15 2022 07:48:39 GMT+0800 (China Standard Time)

The numbers in column-major setting.

Devin Matthews · Answer 7 · Sun Jul 17 2022 00:35:55 GMT+0800 (China Standard Time)

Definitely looks better. Your problems are probably somewhat memory-bound so you may not be able to do better. I'll try and set up a branch for you try test out early next week.

Kihiro Bando · Answer 8 · Sun Jul 17 2022 03:41:46 GMT+0800 (China Standard Time)

Thanks. Is your experiement something that you expect you will be able to stably merge into master relatively easily? If your changes allow me to get the same kind of performance with a row-major ordering, I would hold off making a bunch of changes in my code to make the layout change.

Devin Matthews · Answer 9 · Tue Jul 19 2022 02:52:19 GMT+0800 (China Standard Time)

@bandokihiro I will make branch(es) after meeting with @fgvanzee tomorrow. I want to make sure I test things which can potentially be incorporated into the master branch. Do note that it might take some time to get merged since we will have to do additional benchmarking etc. Thanks for the bug report and your willingness to be a guinea pig.

Devin Matthews · Answer 10 · Wed Jul 20 2022 03:27:13 GMT+0800 (China Standard Time)

@bandokihiro what version of BLIS are you using now?

Devin Matthews · Answer 11 · Wed Jul 20 2022 04:59:24 GMT+0800 (China Standard Time)

@bandokihiro please try the branch gemmsup-bp-flip. If you could run both row- and column-major experiments that should provide some valuable information.

Kihiro Bando · Answer 12 · Wed Jul 20 2022 05:02:30 GMT+0800 (China Standard Time)

I am on the latest tag so 0.9.0. I'll try to run this today, thanks.

Kihiro Bando · Answer 13 · Wed Jul 20 2022 08:18:17 GMT+0800 (China Standard Time)

column-major / row-major

row-major improved but column-major's performance degraded

Devin Matthews · Answer 14 · Thu Jul 21 2022 23:59:52 GMT+0800 (China Standard Time)

@bandokihiro I'm trying to get access to the 1 (!) AMD machine on campus. If not I may ask you to run even more experiments.

Devin Matthews · Answer 15 · Fri Jul 22 2022 00:01:17 GMT+0800 (China Standard Time)

@bandokihiro to that purpose, would you mind sharing your microbenchmark code?

Kihiro Bando · Answer 16 · Fri Jul 22 2022 00:07:01 GMT+0800 (China Standard Time)

I am using kokkos-kernels, which redirects the dgemm calls to third-party libraries. I don't interface with blis directly. The layout is controlled by a simple template argument of Kokkos views. I think the way it works under the hood is that the redirection is successful when the layouts of A, B, and C match. For one of them (column or row-major), all matrices are tranposed in the call to the library. Is that still of interest to you?

Devin Matthews · Answer 17 · Fri Jul 22 2022 00:10:18 GMT+0800 (China Standard Time)

OK, I can write my own pretty easily.

Devin Matthews · Answer 18 · Sat Jul 23 2022 03:49:55 GMT+0800 (China Standard Time)

@bandokihiro I was able to reproduce your performance graphs and do some additional experiments. I think we've found a simple solution which increases the performance of the row-major case so that it is similar to the column-major case. However we need to check that it doesn't have any unintended consequences.

Kihiro Bando · Answer 19 · Sat Jul 23 2022 03:59:21 GMT+0800 (China Standard Time)

Great, thank you very much! Let me know when I can try.

Devin Matthews · Answer 20 · Wed Jul 27 2022 05:48:03 GMT+0800 (China Standard Time)

@bandokihiro please try the gemmsup-kc0 branch. If this gives good performance for both storage formats then we can go ahead and merge.

Kihiro Bando · Answer 21 · Wed Jul 27 2022 11:21:24 GMT+0800 (China Standard Time)

It looks pretty good. This figure summarizes all results. Left is column-major, right is row-major:

Field G. Van Zee · Answer 22 · Thu Jul 28 2022 23:13:37 GMT+0800 (China Standard Time)

Glad we were able to accomodate you on this, @bandokihiro. Thank you for helping @devinamatthews zero in on a solution.

Kihiro Bando · Answer 23 · Thu Jul 28 2022 23:23:48 GMT+0800 (China Standard Time)

No problem. Thank you very much for your assistance.