Inline ASM clobber list means no spare register for LLVM

Question

Inline ASM clobber list means no spare register for LLVM

jlinford opened this issue 3 years ago · comments

LLVM and Arm Compiler for Linux (ACfL) fail to compile for armsve or a64fx. This code is also failing on upstream LLVM: https://godbolt.org/z/PPzhebnan

error: inline assembly requires more registers than available

Commenting out Line 27 makes LLVM/ACfL work.

In short, the problem seems to be that there is a long clobber list in the inline assembly, and that there are many function arguments. When you add extra function arguments into the list in the ASM then it means you can include fewer registers in the clobber list, and vice versa.

Affected files:

Arm's LLVM and GNU compiler teams have looked into it and concluded that these files are pushing the limits of reasonable use of inline asm. GCC is only able to compile this because it is making an unsafe assumption that saves it the one additional register (over LLVM) that it needs to be able to compile the input. GCC is using sp+imm addressing mode, so is using less registers allowing compiler to allocate registers for other expressions. But inline asm does not guarantee that sp+imm addressing mode is always used in such case, so this code is not portable to other compilers. It's arguable that GCC works through behavior that could be viewed as a bug.

Recommended solutions:

Converted the affected files to straight assembler. This seems to be the easiest path forward.
If the files must be implemented in C, re-implement in ACLE intrinsics.

@xrq-phys, @stepannassyr, thoughts?

stepannassyr · Answer 1 · Fri Sep 10 2021 01:49:54 GMT+0800 (China Standard Time)

Well great - I've actually been working on something that potentially uses inline asm even more heavily than any kernels I've manually written so far :D

With Solution 2 tight control over which architectural register gets used is lost and it's also incompatible with my current work.

I think Solution 1 is the way to go, but it does mean some rework is required. I'm actually mostly using inline asm to avoid the asm boilerplate around stuff like functions, but I'm also not a big fan of relying on bugs/undefined behaviour. While in the context of the upstreamed code only a couple files need to be changed to asm, for my development code and future work quite a bit more needs reworking, so yeah... I agree this is the best solution but I currently cannot invest the time.

Devin Matthews · Answer 2 · Fri Sep 10 2021 03:16:25 GMT+0800 (China Standard Time)

@jlinford why can't the compiler push the function arguments to the stack to save registers? (Actually, all of these already end up on the stack in order to be transferred to the ASM region).

Alternatively, if we change some of the ASM variables to "r" constraints, would that allow the compiler to re-use those registers?

Devin Matthews · Answer 3 · Fri Sep 10 2021 03:17:42 GMT+0800 (China Standard Time)

@jlinford FYI intrinsics are probably out because we have observed very poor instruction reordering in previous attempts (although these were x86_64).

Jeff Hammond · Answer 4 · Sat Sep 11 2021 01:23:14 GMT+0800 (China Standard Time)

@devinamatthews I assume that was an ICC-specific issue. Was it?

Devin Matthews · Answer 5 · Sat Sep 11 2021 01:25:47 GMT+0800 (China Standard Time)

I think it affected all of the compilers (gcc, clang, icc). Basically the problem is that when you unroll the inner loop, then all of the loads get moved to the beginning (presumably to cover the longest latency possible), whereas we need them sprinkled in the right places during the iterations. There's no way to tell the compiler that we know we will be loading from L1.

RuQing Xu · Answer 6 · Sun Sep 12 2021 23:12:06 GMT+0800 (China Standard Time)

In fact bli_gemm_armsve256_asm_d8x8.c is from a previous commit by Linaro Ltd.
It's currently of no use at all.

3/sup kernels are experimental kernels for skinny matrices, but not yet used by any config either.
Removing the 3 affected files would pass compilation.

RuQing Xu · Answer 7 · Sun Sep 12 2021 23:36:54 GMT+0800 (China Standard Time)

@devinamatthews :

Alternatively, if we change some of the ASM variables to "r" constraints, would that allow the compiler to re-use those registers?

Does not seem to work with clang 12.0.5.
But we can set all input variables to "+r" and use them as usual GP registers in asm. i.e. instead of like:

ldr x0, %[a]
add x0, x0, x1
...
:: [a] "m" (a) : "x0", "x1"

we do:

add %[a], %[a], x1
...
: [a] "+r" (a) :: "x1"

For example, the bli_gemm_armsve256_asm_d8x8.c assembly of the problem, if we replace:

From	To
`x0`	`%[aaddr]`
`x1`	`%[baddr]`
`x2`	`%[caddr]`
`x3`	`%[a_next]`
`x4`	`%[b_next]`
`x5`	`%[k_iter]`
`x6`	`%[k_left]`
`x7`	`%[alpha]`
`x8`	`%[beta]`
`x9`	`%[cs_c]`
`x13`	`%[rs_c]`

and remove clobbering of x[0-9] and x13, compilation shall pass clang 12. Here's the file after replacement: bli_gemm_armsve256_asm_d8x8.c.zip

I guess this should be the most straightforward "Solution 3".

Devin Matthews · Answer 8 · Mon Sep 13 2021 00:19:28 GMT+0800 (China Standard Time)

Yes, I like this solution. It still requires "baking in" the calling convention but AFAIK passing the first 8 args in registers is required on AArch64(?)

Devin Matthews · Answer 9 · Mon Sep 13 2021 00:20:37 GMT+0800 (China Standard Time)

Is passing args in x8, x9, and x13 standard?

RuQing Xu · Answer 10 · Mon Sep 13 2021 00:21:00 GMT+0800 (China Standard Time)

It's x0-x7 if I remember.

RuQing Xu · Answer 11 · Mon Sep 13 2021 00:23:13 GMT+0800 (China Standard Time)

I guess that regardless the calling convention, we can just ensure the num of clobbered x? + in/out params <= 31 and compiler should do its job. 🤔

Devin Matthews · Answer 12 · Mon Sep 13 2021 00:24:26 GMT+0800 (China Standard Time)

Oh I see, you don't actually require the above mapping from args to registers, you just put them in "some register" and remove the clobbers for the now-unused explicit registers. I guess the compiler may still insist on reserving a register for e.g. k even though it is only used as k_iter/k_left?

RuQing Xu · Answer 13 · Mon Sep 13 2021 00:27:46 GMT+0800 (China Standard Time)

Seems true for LLVM though it might be possible to infer that k is unused.

Devin Matthews · Answer 14 · Thu Sep 16 2021 01:57:33 GMT+0800 (China Standard Time)

@jlinford what do you think of the solution of using "+r" register constraints for the variables? @xrq-phys I think you have tested this and it works, right? We probably should have been passing args in registers all along except that sometimes we want to wait to load a variable until after the AB product accumulation.

RuQing Xu · Answer 15 · Thu Sep 16 2021 02:12:02 GMT+0800 (China Standard Time)

Yes. It's tested and proposed in #540 .

Jeff Diamond · Answer 16 · Fri Sep 24 2021 22:38:24 GMT+0800 (China Standard Time)

As a quick aside, the singlest largest performance hit we've seen from pushing the boundaries of intrinsics vs assembly is that compilers will not do register blocking unless you constrain the entire function to use less variables than physical registers, which also impacts performance. Register blocking is critical for fast matrix multiplication, and no compiler will do this because compilers don't know what's coming in the future. We've studied this in depth and found that register allocation is the biggest issue forcing use of assembly language, although instruction ordering I'd say was second, due to the issue of load placement, as Devin commented. We've managed to hit 90% of peak with intrinsics - but only on narrow size ranges and cases - it's pretty brittle. Overall, we've found you can't count on more than 75% of peak unless you move to assembly language.

Jeff Diamond · Answer 17 · Fri Sep 24 2021 22:46:59 GMT+0800 (China Standard Time)

TBF, we didn't study the use of intrinsics on ARM or armclang, but these issues seem pretty fundamental to how all compilers work. If C wanted to be a truly fast language, they'd need to add a keyword like "force_register" and provide something like block volatile where code wouldn't be reordered - it'd be taken literally.

Jeff Diamond · Answer 18 · Fri Sep 24 2021 22:58:03 GMT+0800 (China Standard Time)

One more observation that might be very important to John: Even when we had intrinsics kernels getting similar performance to assembly language, they used an order of magnitude more power, pushing the cooling system to its limits. The reason is that the compiler, even with intrinsics, tends to issue more instructions, and use memory I/O less efficiently (for many reasons), while a well running blis asm kernel is barely straining the memory hierarchy. So for an energy efficient platform, you'd definitely want to go with assembly over intrinsics.

Jeff Diamond · Answer 19 · Sat Sep 25 2021 01:21:50 GMT+0800 (China Standard Time)

One approach we had some luck with on x86 was we found you could take standard blis asm kernels and unroll them only half as much. The result saved registers with almost no loss in performance.

…

On 9/9/21 12:15 PM, John C. Linford wrote: LLVM and Arm Compiler for Linux (ACfL) fail to compile for armsve or a64fx. This code is also failing on upstream LLVM: https://godbolt.org/z/PPzhebnan <https://godbolt.org/z/PPzhebnan> |error: inline assembly requires more registers than available | Commenting out Line 27 makes LLVM/ACfL work. In short, the problem seems to be that there is a long clobber list in the inline assembly, and that there are many function arguments. When you add extra function arguments into the list in the ASM then it means you can include fewer registers in the clobber list, and vice versa. Affected files: * https://github.com/flame/blis/blob/master/kernels/armsve/3/bli_gemm_armsve256_asm_d8x8.c <https://github.com/flame/blis/blob/master/kernels/armsve/3/bli_gemm_armsve256_asm_d8x8.c> * https://github.com/flame/blis/blob/master/kernels/armsve/3/sup/bli_gemmsup_cv_armsve_asm_d2vx10_unindexed.c <https://github.com/flame/blis/blob/master/kernels/armsve/3/sup/bli_gemmsup_cv_armsve_asm_d2vx10_unindexed.c> * https://github.com/flame/blis/blob/master/kernels/armsve/3/sup/bli_gemmsup_rv_armsve_asm_d2vx10_unindexed.c <https://github.com/flame/blis/blob/master/kernels/armsve/3/sup/bli_gemmsup_rv_armsve_asm_d2vx10_unindexed.c> Arm's LLVM and GNU compiler teams have looked into it and concluded that these files are pushing the limits of reasonable use of inline asm. GCC is only able to compile this because it is making an unsafe assumption that saves it the one additional register (over LLVM) that it needs to be able to compile the input. GCC is using sp+imm addressing mode, so is using less registers allowing compiler to allocate registers for other expressions. But inline asm does not guarantee that sp+imm addressing mode is always used in such case, so this code is not portable to other compilers. It's arguable that GCC works through behavior that could be viewed as a bug. Recommended solutions: 1. Converted the affected files to straight assembler. This seems to be the easiest path forward. 2. If the files must be implemented in C, re-implement in ACLE intrinsics. @xrq-phys <https://github.com/xrq-phys>, @stepannassyr <https://github.com/stepannassyr>, thoughts? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#539>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHGHJKY5GCNDZAW2GF3R6GLUBDTTLANCNFSM5DXTEUHQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Devin Matthews · Answer 20 · Tue Oct 05 2021 04:46:07 GMT+0800 (China Standard Time)

I'm closing this because the specific problem is solved by #554. In the future, I prefer the "Solution 3" of @xrq-phys.