[DT][CPU] ConstEval folding on quantized matmul with data-tiling
Max191 opened this issue · comments
I am not getting ConstEval to fold away the transpose and packing of the constant weights. Here is the IR that I am working with: https://gist.github.com/Max191/0b515716d65478d0e9fad83673c5a616
#map = affine_map<(d0, d1, d2) -> (d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module{
util.global private @cst = #util.byte_pattern<1> : tensor<11008x32x128xi4>
util.global private mutable @global_seed = #util.byte_pattern<2> : tensor<i64>
func.func @transpose_extend_batch_matmul(%arg0: tensor<32x128xi16>) -> tensor<11008x32xi32> {
%cst = util.global.load @cst : tensor<11008x32x128xi4>
%c0_i32 = arith.constant 0 : i32
%0 = tensor.empty() : tensor<11008x32xi32>
%1 = linalg.fill ins(%c0_i32 : i32) outs(%0 : tensor<11008x32xi32>) -> tensor<11008x32xi32>
%2 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0, %cst : tensor<32x128xi16>, tensor<11008x32x128xi4>) outs(%1 : tensor<11008x32xi32>) {
^bb0(%in: i16, %in_0: i4, %out: i32):
%3 = arith.extsi %in : i16 to i32
%4 = arith.extui %in_0 : i4 to i32
%5 = arith.muli %3, %4 : i32
%6 = arith.addi %5, %out : i32
linalg.yield %6 : i32
} -> tensor<11008x32xi32>
return %2 : tensor<11008x32xi32>
}
}
This is the iree-compile
command:
iree-compile \
--iree-input-type=none \
--iree-vm-bytecode-module-output-format=flatbuffer-binary \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-cpu-features=host \
--iree-llvmcpu-target-triple=x86_64-linux-gnu \
--iree-llvmcpu-enable-microkernels \
--iree-stream-resource-index-bits=64 \
--iree-vm-target-index-bits=64 \
--iree-vm-bytecode-module-strip-source-map=true \
--iree-util-zero-fill-elided-attrs \
--iree-vm-target-truncate-unsupported-floats \
--iree-codegen-check-ir-before-llvm-conversion=false \
--iree-opt-const-expr-hoisting=False \
--iree-opt-data-tiling \
--mlir-print-ir-after-all \
--iree-consteval-jit-debug \
--mlir-disable-threading \
-o transpose_extend_batch_matmul.vmfb \
transpose_extend_batch_matmul.mlir \
2> dump.mlir
Use the following branched for llvm and IREE
https://github.com/Max191/llvm-project/tree/quantized-matmul-data-tiling-ukernel-test-branch (can just pick the last commit)
https://github.com/Max191/iree/tree/new-quantized-ukernels-codegen
#util.byte_pattern<1>
will never fold - you have to use an actual constant tensor (dense<....>
)
(byte patterns are ways of saying "I want this exact pattern of bytes on disk" and the compiler just treats them as opaque blobs)
Thanks Max!
Ben, I will update the IR.
Also I will add that for sub-byte types, it would be good to try to prevent const-eval from generating constant tensors in MLIR with sub-byte element types. Instead we can store an i8 or i32 tensor after const-eval and bitcast to the appropriate tensor. This will avoid the overhead of unpacking/repacking that we've had to deal with for sub-byte constants thus far.
good idea - may be worth adding that casting intentionally too
good idea - may be worth adding that casting intentionally too
Yep! This was the main reason I pushed on the bitcast op. Packing -> unpacking -> repacking large weight constants once we inevitably wanted consteval for data tiling was going to be terrible for compiler performance.
I put together a draft of my idea in #15558, although it is still "contingent" on codegen support for these expression trees
(it kind of isn't given that we don't support sub-byte stores in codegen to begin with, meaning initializer or not, we will fail on such expression trees, but makes sense not to flip a flag because unsupported while it's still unsupported I guess).
Here is the mlir file I used for const-eval:
#map = affine_map<(d0, d1, d2) -> (d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module{
func.func @transpose_extend_batch_matmul(%arg0: tensor<32x128xi16>) -> tensor<11008x32xi32> {
%cst = arith.constant dense<1> : tensor<11008x32x128xi4>
%c0_i32 = arith.constant 0 : i32
%0 = tensor.empty() : tensor<11008x32xi32>
%1 = linalg.fill ins(%c0_i32 : i32) outs(%0 : tensor<11008x32xi32>) -> tensor<11008x32xi32>
%2 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0, %cst : tensor<32x128xi16>, tensor<11008x32x128xi4>) outs(%1 : tensor<11008x32xi32>) {
^bb0(%in: i16, %in_0: i4, %out: i32):
%3 = arith.extsi %in : i16 to i32
%4 = arith.extui %in_0 : i4 to i32
%5 = arith.muli %3, %4 : i32
%6 = arith.addi %5, %out : i32
linalg.yield %6 : i32
} -> tensor<11008x32xi32>
return %2 : tensor<11008x32xi32>
}
}
After skimming through the log, I think there are two issues we need to fix:
- The generic op is not vectorized: #15574
- Be able to fold subview.memref away before sub-bypte emulation: #15575
The latter one is blocking codegen for sub-type store emulation.