Parallelization of the PC loop

Question

Parallelization of the PC loop

devinamatthews opened this issue 3 years ago · comments

Many applications (e.g. in machine learning) end up with matrix products that have small(ish) m and n dimensions, but large K. For these cases, one would need to parallelize the pc loop to take advantage of threading in an efficient way. An example application is #536. The downside of parallelizing the pc loop is that extra workspace is required to store the partial C matrices, and then these matrices have to be reduced (in parallel probably).

To support parallelizing this loop, I propose the following plan:

Determine the maximum size max_size_c of the temporary C matrices that we are willing to store. This could be a constant (e.g. 8MiB) or it could depend on the cache blocking parameters in some way (e.g. not larger than the sum of the A block and B panel sizes).
Modify the automatic partitioning code to set the parallelism of the PC loop to the maximum value nt_pc (which is a factor of nt?) for which (nt_pc-1)*m*n <= max_size_c, and then assign the remaining ways of parallelism based on nt -> nt/nt_pc. Alternatively, the current algortihm could be extended to handle nt_ic, nt_jc, and nt_pc simultaneously and also obey the (nt_pc-1)*m*n <= max_size_c restriction. If the user sets BLIS_PC_NT then that can be used as-is (without obeying the size restriction?).
In bli_gemm_blk_var3.c:
- The main thread in each thread group (after splitting for parallelism in the pc direction), except the "main" thread group, should check out a block from the existing pool for C, which is of size m*n.
- The main thread group gets the "real" C and keeps beta as-is, all other threads get a chunk of the C buffer and set beta = 0.
- Parallelize the pc loop as in var1 and var2.
- After the loop, reduce (sum) the additional blocks back into the main block. This can be parallelized over m and n, and use e.g. axpym.
- If the blocks of C should be of most size m_c along the m dimension, then the reduction step must happen in bli_gemm_blk_var1.c.

Minh Quan Ho · Answer 1 · Sat Aug 28 2021 01:46:22 GMT+0800 (China Standard Time)

The main thread should check out a block from the existing pool for C, which is of size (nt_pc-1)mn and broadcast to the other threads.

Since the main thread takes the "real" C and original beta, the broadcast could not be replaced by just a reset to zero of each part of other threads ?
We should also be aware of floating-point accuracy, too (typically for float) due to rounding order.

Devin Matthews · Answer 2 · Sat Aug 28 2021 05:23:44 GMT+0800 (China Standard Time)

@hominhquan by broadcast I just mean pointer broadcast. Only the main thread includes the beta*C part.

Devin Matthews · Answer 3 · Sat Aug 28 2021 05:24:18 GMT+0800 (China Standard Time)

@fgvanzee I welcome your comments.

Field G. Van Zee · Answer 4 · Sat Aug 28 2021 05:36:19 GMT+0800 (China Standard Time)

Sure, though I'm going to need some time to think about this and how it compares and contrasts with my own ideas.

Field G. Van Zee · Answer 5 · Sat Aug 28 2021 05:38:36 GMT+0800 (China Standard Time)

ii. 1 if n > n_c.

@devinamatthews Can you explain the thinking behind this?

Devin Matthews · Answer 6 · Sat Aug 28 2021 07:20:54 GMT+0800 (China Standard Time)

ii. 1 if n > n_c

I'm reconsidering this now, but my thinking was that a) you only want to parallelize over k if n is smallish anyways and b) you only want to checkout a block for C and do a reduction once. But, on second though I don't think there's really a downside to b), and maybe you do still want to parallelize over k for some bizarre situation like m=20, n=5000, k=10000.

Field G. Van Zee · Answer 7 · Sat Aug 28 2021 07:29:45 GMT+0800 (China Standard Time)

and maybe you do still want to parallelize over k for some bizarre situation like m=20, n=5000, k=10000.

I think if @rvdg were here, he would likely say that this isn't bizarre at all. We want to have maximum flexibility in the dimensions along which we parallelize, and oftentimes that means parallelizing two dimension simultaneously, especially when you have large numbers of cores. And if only one of m and n is large (and k is also large) that means (potentially) parallelizing k.

Devin Matthews · Answer 8 · Sat Aug 28 2021 07:35:32 GMT+0800 (China Standard Time)

Agreed. Drop that requirement.

Field G. Van Zee · Answer 9 · Sun Aug 29 2021 23:31:20 GMT+0800 (China Standard Time)

I think I know how to solve this for generalized problems. It builds upon the idea of carouseling, which @tlrmchlsmth developed and prototyped years ago. The key feature of that solution is that it avoids using large workspace, but it will also require ~~significant~~ non-trivial changes to the overall matrix multiplication algorithm.

(Historical aside: I found Tyler's original implementation to be a bit rough around the edges. I also did not fully understand it at the time, and those two things contributed to it going unmerged. But the idea, which was delightfully simple, stuck in my head.)

So, I think it's time to reimplement Tyler's carouseling. I'll likely start off in the gemmlike sandbox so that I can get a bird's eye view of what parts of the Goto algorithm need changing, which will then inform if and how I can cleanly port the idea to the conventional code path.

Devin Matthews · Answer 10 · Sun Aug 29 2021 23:48:15 GMT+0800 (China Standard Time)

@fgvanzee let's talk about this; IIRC carouseling doesn't take full advantage of parallelism in both dimensions.

Robert van de Geijn · Answer 11 · Mon Aug 30 2021 00:59:10 GMT+0800 (China Standard Time)

I’d like to be part of that discussion

…

On Aug 29, 2021, at 10:48 AM, Devin Matthews ***@***.***> wrote: @fgvanzee <https://github.com/fgvanzee> let's talk about this; IIRC carouseling doesn't take full advantage of parallelism in both dimensions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#537 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLLYJ3QOBLRDOYWE4NOUVTT7JJEXANCNFSM5C56JJGA>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Devin Matthews · Answer 12 · Mon Aug 30 2021 01:03:14 GMT+0800 (China Standard Time)

Tuesday then.

Tyler Michael Smith · Answer 13 · Tue Aug 31 2021 02:23:18 GMT+0800 (China Standard Time)

@fgvanzee let's talk about this; IIRC carouseling doesn't take full advantage of parallelism in both dimensions.

The thing about carouselling is that it doesn't actually increase the number of FMA instructions that you can execute in parallel. Instead, the advantage comes from not messing up your block sizes. So instead of just parallelizing along the m dimension and cutting the matrix up into too-small of partitions, we also parallelize along the k dimension. And then the carousel mechanism just manages the concurrency.

Anyway, I think there is probably enough interesting stuff here to explore the tradeoffs between the two approaches and situations where each does better than the other!