[Winograd] Semantics and fusion changes for winograd ops
Max191 opened this issue · comments
This issue is a tracker for landing several semantics and fusion changes to the linalg_ext.winograd
ops that have been experimentally tested while optimizing VAE on CPU.
Fusion Improvements
Two new fusions with linalg_ext.winograd.input_transform
ops were beneficial for VAE on CPU:
- Fusion with consumer elementwise ops: This would generally not be helpful, since the consumer of a winograd input transform is usually a batch matmul op, but on CPU,
f32
contraction inputs are often demoted tobf16
to reduce memory load and target fast microkernels. This means that elementwise fusion is important for CPU performance, and it is also the only consumer fusion case that should be needed. - Fusion with producer pad ops: Almost all
linalg_ext.winograd.input_transform
ops have a producertensor.pad
op, since they come from rewriting convolutions. This means there can be benefit in fusing the winograd op with the padding. On VAE, mosttensor.pad
ops become slow_memcpy dispatches unless fused with the winograd op. This fusion is useful, but still has some details to work out for pad codegen (CPU is probably inefficient and it is untested on GPU).
These fusions should be enabled using the new LinalgExt fusion op interface introduced in #17428. (1) should be relatively simple, since it is just an elementwise fusion, so we can fuse as long as indexing maps line up. (2) should also be somewhat simple, but the main difficulties are with codegen for tensor.pad.
Semantics Changes
There are 3 changes to winograd op semantics that are necessary for fusions and better codegen:
- Making batch dimension optional: When the batch dimension is a unit dimension, the winograd op verifiers still require that dimension to be present. This leads to some difficulties for Flow level fusions, since the unit dimensions are not properly folded in the winograd ops.
- Allow different Input Tile dimensions: The
linalg_ext.winograd.input_transform
op has an output shape liketensor<TxTxNxHxWxC>
, whereT
is the input tile size. When the op is tiled and decomposed, the inner loop writes a result of sizeTxT
to first two outer-most dimensions of the result tensor. This means the writes get unrolled and become scalarized, which was especially bad for CPU performance. Similarly, the input shape to thelinalg_ext.winograd.output_transform
op istensor<TxTxNxHxWxF>
, which results in scalarized loads. The semantics of the winograd ops will need to allow for a better layout of the transformed type, liketensor<NxHxWxCxTxT>
.
There is still a tradeoff, however, with loads on the batch_matmul following the input transform op. Since theTxT
dimensions are collapsed to become the batch dimension of the batch_matmul, the batch dimension becomes innermost in the batch_matmul contraction. On CPU, this is okay because the inputs get packed beforehand anyway, and we can target ukernels for pack and unpack to optimize loads/stores for the bad layout. On GPU, it still is unclear whether this will be helpful overall, since loads can get much more parallelized. Typically, the input loads will be distributed to threads when promoting to shared memory in the matrix multiplication, so the cost of the bad batch_matmul layout may not be too bad, but it has not been tested yet. Because of this target specific difference, the winograd op should support any layout of theTxT
dimensions, and a new field will need to be added to the ops calledinput_tile_dimensions
, which will contain a list of dimensions corresponding to the input tile in the transformed type. - Add extract slice semantics to output transform: The
linalg_ext.winograd.output_transform
produces an output tensor that is padded along the imageH
andW
dimensions, and when aconv_2d
op is rewritten into winograd ops, atensor.extract_slice
op is needed. This can sometimes become its own dispatch, however, which is not necessary. To fix this, a possible solution would be to add extract slice semantics to the output transform op. However, there are cases where the extract_slice can be fused with the consumer op instead, and when the extract slice semantics make efficient codegen more difficult for the winograd op, so it is still unclear if it is better to have extract_slice semantics as part of the op.
All of the changes discussed above have been experimentally tested on CPU for VAE, and have some prototype implementation on https://github.com/iree-org/iree/tree/shared/tresleches-cpu. I will send out PRs for the obvious changes first, but some of the ideas need testing or codegen support on GPU backends as well.
Padding Semantics
The linalg_ext.winograd.input_transform
op has implicit padding on the input operand, which I have been rethinking after a discussion with @bjacob. Winograd input transform ops will generally have a producer that is a tensor.pad
op, but they also have implicit padding due to non-perfectly sized input tensors. The following example describes this:
%4 = tensor.empty() : tensor<11x11x16x8x8xf32>
%padded = tensor.pad %2 low[0, 1, 1] high[0, 1, 1] {
^bb0(%arg0: index, %arg1: index, %arg2: index):
tensor.yield %cst : f32
} : tensor<16x64x64xf32> to tensor<16x66x66xf32>
%5 = iree_linalg_ext.winograd.input_transform
output_tile_size(6) kernel_size(3)
image_dimensions([1, 2]) input_tile_dimensions([3, 4])
ins(%padded : tensor<16x66x66xf32>)
outs(%4 : tensor<11x11x16x8x8xf32>) -> tensor<11x11x16x8x8xf32>
The input transform op will get tiled along the 11x11x16
outer output dimensions, which will all get tiled to 1
. The input to the winograd op after tiling is read as an 8x8
extracted at offsets strided by 6 along each of 2 inner dimensions. For example, the first read of the tensor<16x66x66xf32>
will be at index (0,0,0)
with size 1x8x8
, the second at (0,0,6)
with size 1x8x8
, and so on. This means that the last read will be at index (15,60,60)
with size 1x8x8
, which reads out of bounds and must be padded.
This padding is implicit in the linalg_ext.winograd.input_transform
op, which means there is padding happening in both the producer tensor.pad
op and the winograd op. Alternatively, the padding could be decoupled from the winograd op, and another tensor.pad
could be added instead:
%4 = tensor.empty() : tensor<11x11x16x8x8xf32>
%padded = tensor.pad %2 low[0, 1, 1] high[0, 1, 1] {
^bb0(%arg0: index, %arg1: index, %arg2: index):
tensor.yield %cst : f32
} : tensor<16x64x64xf32> to tensor<16x66x66xf32>
%padded2 = tensor.pad %padded low[0, 0, 0] high[0, 2, 2] {
^bb0(%arg0: index, %arg1: index, %arg2: index):
tensor.yield %cst : f32
} : tensor<16x66x66xf32> to tensor<16x68x68xf32>
%5 = iree_linalg_ext.winograd.input_transform
output_tile_size(6) kernel_size(3)
image_dimensions([1, 2]) input_tile_dimensions([3, 4])
ins(%padded2 : tensor<16x68x68xf32>)
outs(%4 : tensor<11x11x16x8x8xf32>) -> tensor<11x11x16x8x8xf32>
Which can be composed into a single pad:
%4 = tensor.empty() : tensor<11x11x16x8x8xf32>
%padded = tensor.pad %padded low[0, 1, 1] high[0, 3, 3] {
^bb0(%arg0: index, %arg1: index, %arg2: index):
tensor.yield %cst : f32
} : tensor<16x66x66xf32> to tensor<16x68x68xf32>
%5 = iree_linalg_ext.winograd.input_transform
output_tile_size(6) kernel_size(3)
image_dimensions([1, 2]) input_tile_dimensions([3, 4])
ins(%padded : tensor<16x68x68xf32>)
outs(%4 : tensor<11x11x16x8x8xf32>) -> tensor<11x11x16x8x8xf32>
Leaving the padding separate from the winograd op means that we can decide whether to fuse all padding with either the winograd op or the producer of the pad, which is better than having padding in 2 different places.