flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inline ASM clobber list means no spare register for LLVM

jlinford opened this issue · comments

LLVM and Arm Compiler for Linux (ACfL) fail to compile for armsve or a64fx. This code is also failing on upstream LLVM: https://godbolt.org/z/PPzhebnan

error: inline assembly requires more registers than available

Commenting out Line 27 makes LLVM/ACfL work.

In short, the problem seems to be that there is a long clobber list in the inline assembly, and that there are many function arguments. When you add extra function arguments into the list in the ASM then it means you can include fewer registers in the clobber list, and vice versa.

Affected files:

Arm's LLVM and GNU compiler teams have looked into it and concluded that these files are pushing the limits of reasonable use of inline asm. GCC is only able to compile this because it is making an unsafe assumption that saves it the one additional register (over LLVM) that it needs to be able to compile the input. GCC is using sp+imm addressing mode, so is using less registers allowing compiler to allocate registers for other expressions. But inline asm does not guarantee that sp+imm addressing mode is always used in such case, so this code is not portable to other compilers. It's arguable that GCC works through behavior that could be viewed as a bug.

Recommended solutions:

  1. Converted the affected files to straight assembler. This seems to be the easiest path forward.
  2. If the files must be implemented in C, re-implement in ACLE intrinsics.

@xrq-phys, @stepannassyr, thoughts?

Well great - I've actually been working on something that potentially uses inline asm even more heavily than any kernels I've manually written so far :D

With Solution 2 tight control over which architectural register gets used is lost and it's also incompatible with my current work.

I think Solution 1 is the way to go, but it does mean some rework is required. I'm actually mostly using inline asm to avoid the asm boilerplate around stuff like functions, but I'm also not a big fan of relying on bugs/undefined behaviour. While in the context of the upstreamed code only a couple files need to be changed to asm, for my development code and future work quite a bit more needs reworking, so yeah... I agree this is the best solution but I currently cannot invest the time.

@jlinford why can't the compiler push the function arguments to the stack to save registers? (Actually, all of these already end up on the stack in order to be transferred to the ASM region).

Alternatively, if we change some of the ASM variables to "r" constraints, would that allow the compiler to re-use those registers?

@jlinford FYI intrinsics are probably out because we have observed very poor instruction reordering in previous attempts (although these were x86_64).

@devinamatthews I assume that was an ICC-specific issue. Was it?

I think it affected all of the compilers (gcc, clang, icc). Basically the problem is that when you unroll the inner loop, then all of the loads get moved to the beginning (presumably to cover the longest latency possible), whereas we need them sprinkled in the right places during the iterations. There's no way to tell the compiler that we know we will be loading from L1.

In fact bli_gemm_armsve256_asm_d8x8.c is from a previous commit by Linaro Ltd.
It's currently of no use at all.

3/sup kernels are experimental kernels for skinny matrices, but not yet used by any config either.
Removing the 3 affected files would pass compilation.

@devinamatthews :

Alternatively, if we change some of the ASM variables to "r" constraints, would that allow the compiler to re-use those registers?

Does not seem to work with clang 12.0.5.
But we can set all input variables to "+r" and use them as usual GP registers in asm. i.e. instead of like:

ldr x0, %[a]
add x0, x0, x1
...
:: [a] "m" (a) : "x0", "x1"

we do:

add %[a], %[a], x1
...
: [a] "+r" (a) :: "x1"

For example, the bli_gemm_armsve256_asm_d8x8.c assembly of the problem, if we replace:

From To
x0 %[aaddr]
x1 %[baddr]
x2 %[caddr]
x3 %[a_next]
x4 %[b_next]
x5 %[k_iter]
x6 %[k_left]
x7 %[alpha]
x8 %[beta]
x9 %[cs_c]
x13 %[rs_c]

and remove clobbering of x[0-9] and x13, compilation shall pass clang 12. Here's the file after replacement: bli_gemm_armsve256_asm_d8x8.c.zip

I guess this should be the most straightforward "Solution 3".

Yes, I like this solution. It still requires "baking in" the calling convention but AFAIK passing the first 8 args in registers is required on AArch64(?)

Is passing args in x8, x9, and x13 standard?

It's x0-x7 if I remember.

I guess that regardless the calling convention, we can just ensure the num of clobbered x? + in/out params <= 31 and compiler should do its job. 🤔

Oh I see, you don't actually require the above mapping from args to registers, you just put them in "some register" and remove the clobbers for the now-unused explicit registers. I guess the compiler may still insist on reserving a register for e.g. k even though it is only used as k_iter/k_left?

Seems true for LLVM though it might be possible to infer that k is unused.

@jlinford what do you think of the solution of using "+r" register constraints for the variables? @xrq-phys I think you have tested this and it works, right? We probably should have been passing args in registers all along except that sometimes we want to wait to load a variable until after the AB product accumulation.

Yes. It's tested and proposed in #540 .

As a quick aside, the singlest largest performance hit we've seen from pushing the boundaries of intrinsics vs assembly is that compilers will not do register blocking unless you constrain the entire function to use less variables than physical registers, which also impacts performance. Register blocking is critical for fast matrix multiplication, and no compiler will do this because compilers don't know what's coming in the future. We've studied this in depth and found that register allocation is the biggest issue forcing use of assembly language, although instruction ordering I'd say was second, due to the issue of load placement, as Devin commented. We've managed to hit 90% of peak with intrinsics - but only on narrow size ranges and cases - it's pretty brittle. Overall, we've found you can't count on more than 75% of peak unless you move to assembly language.

TBF, we didn't study the use of intrinsics on ARM or armclang, but these issues seem pretty fundamental to how all compilers work. If C wanted to be a truly fast language, they'd need to add a keyword like "force_register" and provide something like block volatile where code wouldn't be reordered - it'd be taken literally.

One more observation that might be very important to John: Even when we had intrinsics kernels getting similar performance to assembly language, they used an order of magnitude more power, pushing the cooling system to its limits. The reason is that the compiler, even with intrinsics, tends to issue more instructions, and use memory I/O less efficiently (for many reasons), while a well running blis asm kernel is barely straining the memory hierarchy. So for an energy efficient platform, you'd definitely want to go with assembly over intrinsics.

I'm closing this because the specific problem is solved by #554. In the future, I prefer the "Solution 3" of @xrq-phys.