Make tensors flat at boundaries / make row-major indexing explicit during mid-codegen

Question

Make tensors flat at boundaries / make row-major indexing explicit during mid-codegen

krzysz00 opened this issue a month ago · comments

Krzysztof Drewniak commented a month ago

A change I lately made in rocMLIR (that gave us statistically significant performance gains on various tasks, mainly convolutions) was to (in IREE's terms) make all tensors 1D and then tensor.expand_shape them out to their logical shape.

The advantage to this is that makes the final row-major indexing math visible to codegen and optimization passes.

For example, if you need to do %n, %c, %h, %w = apply <(d0, d1) -> (d0 / 9, d1, (d0 % 9) / 3, d0 % 3)>(...) for im2col reasons, taking the index %image[%n, %c, %h, %w] : tensor<64x128x3x3xf32> hides the fact that you're about to apply the map (d0, d1, d2, d3) -> (d3 + 3 * (d2 + 3 * (d1 + 128 * d0))). Those two maps should be composed to give the simpler indexing logic (d0, d1) -> (d0 % 9 + 9 * (d1 + 128 * (d0 / 9)), which is a simplification that's quite difficult to do late in the pipeline / in LLVM.

Therefore, somewhere in the middle of the lowering flow - I figure y'all'll have a better idea where, there should be a pass that makes the row-major nature of tensors explicit.

(This also very much applies to shared memory allocations - they should be 1D buffers that're expand_shape'd to what they logically need to be. I made that change years back in rocMLIr and found meaningful performance gains from it)

ROCm/rocMLIR#1466 is our version of this (I can dig up the exact perf numbers if they're of interest)