[Winograd] Semantics and fusion changes for winograd ops

Question

[Winograd] Semantics and fusion changes for winograd ops

Max191 opened this issue 2 months ago · comments

This issue is a tracker for landing several semantics and fusion changes to the linalg_ext.winograd ops that have been experimentally tested while optimizing VAE on CPU.

Fusion Improvements

Two new fusions with linalg_ext.winograd.input_transform ops were beneficial for VAE on CPU:

Fusion with consumer elementwise ops: This would generally not be helpful, since the consumer of a winograd input transform is usually a batch matmul op, but on CPU, f32 contraction inputs are often demoted to bf16 to reduce memory load and target fast microkernels. This means that elementwise fusion is important for CPU performance, and it is also the only consumer fusion case that should be needed.
Fusion with producer pad ops: Almost all linalg_ext.winograd.input_transform ops have a producer tensor.pad op, since they come from rewriting convolutions. This means there can be benefit in fusing the winograd op with the padding. On VAE, most tensor.pad ops become slow_memcpy dispatches unless fused with the winograd op. This fusion is useful, but still has some details to work out for pad codegen (CPU is probably inefficient and it is untested on GPU).

These fusions should be enabled using the new LinalgExt fusion op interface introduced in #17428. (1) should be relatively simple, since it is just an elementwise fusion, so we can fuse as long as indexing maps line up. (2) should also be somewhat simple, but the main difficulties are with codegen for tensor.pad.

Semantics Changes

There are 3 changes to winograd op semantics that are necessary for fusions and better codegen:

Making batch dimension optional: When the batch dimension is a unit dimension, the winograd op verifiers still require that dimension to be present. This leads to some difficulties for Flow level fusions, since the unit dimensions are not properly folded in the winograd ops.
Allow different Input Tile dimensions: The linalg_ext.winograd.input_transform op has an output shape like tensor<TxTxNxHxWxC>, where T is the input tile size. When the op is tiled and decomposed, the inner loop writes a result of size TxT to first two outer-most dimensions of the result tensor. This means the writes get unrolled and become scalarized, which was especially bad for CPU performance. Similarly, the input shape to the linalg_ext.winograd.output_transform op is tensor<TxTxNxHxWxF>, which results in scalarized loads. The semantics of the winograd ops will need to allow for a better layout of the transformed type, like tensor<NxHxWxCxTxT>.
There is still a tradeoff, however, with loads on the batch_matmul following the input transform op. Since the TxT dimensions are collapsed to become the batch dimension of the batch_matmul, the batch dimension becomes innermost in the batch_matmul contraction. On CPU, this is okay because the inputs get packed beforehand anyway, and we can target ukernels for pack and unpack to optimize loads/stores for the bad layout. On GPU, it still is unclear whether this will be helpful overall, since loads can get much more parallelized. Typically, the input loads will be distributed to threads when promoting to shared memory in the matrix multiplication, so the cost of the bad batch_matmul layout may not be too bad, but it has not been tested yet. Because of this target specific difference, the winograd op should support any layout of the TxT dimensions, and a new field will need to be added to the ops called input_tile_dimensions, which will contain a list of dimensions corresponding to the input tile in the transformed type.
Add extract slice semantics to output transform: The linalg_ext.winograd.output_transform produces an output tensor that is padded along the image H and W dimensions, and when a conv_2d op is rewritten into winograd ops, a tensor.extract_slice op is needed. This can sometimes become its own dispatch, however, which is not necessary. To fix this, a possible solution would be to add extract slice semantics to the output transform op. However, there are cases where the extract_slice can be fused with the consumer op instead, and when the extract slice semantics make efficient codegen more difficult for the winograd op, so it is still unclear if it is better to have extract_slice semantics as part of the op.

All of the changes discussed above have been experimentally tested on CPU for VAE, and have some prototype implementation on https://github.com/iree-org/iree/tree/shared/tresleches-cpu. I will send out PRs for the obvious changes first, but some of the ideas need testing or codegen support on GPU backends as well.

Padding Semantics

The linalg_ext.winograd.input_transform op has implicit padding on the input operand, which I have been rethinking after a discussion with @bjacob. Winograd input transform ops will generally have a producer that is a tensor.pad op, but they also have implicit padding due to non-perfectly sized input tensors. The following example describes this:

%4 = tensor.empty() : tensor<11x11x16x8x8xf32>
%padded = tensor.pad %2 low[0, 1, 1] high[0, 1, 1] {
^bb0(%arg0: index, %arg1: index, %arg2: index):
  tensor.yield %cst : f32
} : tensor<16x64x64xf32> to tensor<16x66x66xf32>
%5 = iree_linalg_ext.winograd.input_transform
    output_tile_size(6) kernel_size(3)
    image_dimensions([1, 2]) input_tile_dimensions([3, 4])
    ins(%padded : tensor<16x66x66xf32>)
    outs(%4 : tensor<11x11x16x8x8xf32>) -> tensor<11x11x16x8x8xf32>

The input transform op will get tiled along the 11x11x16 outer output dimensions, which will all get tiled to 1. The input to the winograd op after tiling is read as an 8x8 extracted at offsets strided by 6 along each of 2 inner dimensions. For example, the first read of the tensor<16x66x66xf32> will be at index (0,0,0) with size 1x8x8, the second at (0,0,6) with size 1x8x8, and so on. This means that the last read will be at index (15,60,60) with size 1x8x8, which reads out of bounds and must be padded.

This padding is implicit in the linalg_ext.winograd.input_transform op, which means there is padding happening in both the producer tensor.pad op and the winograd op. Alternatively, the padding could be decoupled from the winograd op, and another tensor.pad could be added instead:

%4 = tensor.empty() : tensor<11x11x16x8x8xf32>
%padded = tensor.pad %2 low[0, 1, 1] high[0, 1, 1] {
^bb0(%arg0: index, %arg1: index, %arg2: index):
  tensor.yield %cst : f32
} : tensor<16x64x64xf32> to tensor<16x66x66xf32>
%padded2 = tensor.pad %padded low[0, 0, 0] high[0, 2, 2] {
^bb0(%arg0: index, %arg1: index, %arg2: index):
  tensor.yield %cst : f32
} : tensor<16x66x66xf32> to tensor<16x68x68xf32>
%5 = iree_linalg_ext.winograd.input_transform
    output_tile_size(6) kernel_size(3)
    image_dimensions([1, 2]) input_tile_dimensions([3, 4])
    ins(%padded2 : tensor<16x68x68xf32>)
    outs(%4 : tensor<11x11x16x8x8xf32>) -> tensor<11x11x16x8x8xf32>

Which can be composed into a single pad:

%4 = tensor.empty() : tensor<11x11x16x8x8xf32>
%padded = tensor.pad %padded low[0, 1, 1] high[0, 3, 3] {
^bb0(%arg0: index, %arg1: index, %arg2: index):
  tensor.yield %cst : f32
} : tensor<16x66x66xf32> to tensor<16x68x68xf32>
%5 = iree_linalg_ext.winograd.input_transform
    output_tile_size(6) kernel_size(3)
    image_dimensions([1, 2]) input_tile_dimensions([3, 4])
    ins(%padded : tensor<16x68x68xf32>)
    outs(%4 : tensor<11x11x16x8x8xf32>) -> tensor<11x11x16x8x8xf32>

Leaving the padding separate from the winograd op means that we can decide whether to fuse all padding with either the winograd op or the producer of the pad, which is better than having padding in 2 different places.