[DT][CPU] ConstEval folding on quantized matmul with data-tiling

Question

[DT][CPU] ConstEval folding on quantized matmul with data-tiling

Max191 opened this issue 7 months ago · comments

I am not getting ConstEval to fold away the transpose and packing of the constant weights. Here is the IR that I am working with: https://gist.github.com/Max191/0b515716d65478d0e9fad83673c5a616

#map = affine_map<(d0, d1, d2) -> (d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module{
  util.global private @cst = #util.byte_pattern<1> : tensor<11008x32x128xi4>
  util.global private mutable @global_seed = #util.byte_pattern<2> : tensor<i64>
  func.func @transpose_extend_batch_matmul(%arg0: tensor<32x128xi16>) -> tensor<11008x32xi32> {
    %cst = util.global.load @cst : tensor<11008x32x128xi4>
    %c0_i32 = arith.constant 0 : i32
    %0 = tensor.empty() : tensor<11008x32xi32>
    %1 = linalg.fill ins(%c0_i32 : i32) outs(%0 : tensor<11008x32xi32>) -> tensor<11008x32xi32>
    %2 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0, %cst : tensor<32x128xi16>, tensor<11008x32x128xi4>) outs(%1 : tensor<11008x32xi32>) {
    ^bb0(%in: i16, %in_0: i4, %out: i32):
      %3 = arith.extsi %in : i16 to i32
      %4 = arith.extui %in_0 : i4 to i32
      %5 = arith.muli %3, %4 : i32
      %6 = arith.addi %5, %out : i32
      linalg.yield %6 : i32
    } -> tensor<11008x32xi32>
    return %2 : tensor<11008x32xi32>
  }
}

This is the iree-compile command:

iree-compile \
    --iree-input-type=none \
    --iree-vm-bytecode-module-output-format=flatbuffer-binary \
    --iree-hal-target-backends=llvm-cpu \
    --iree-llvmcpu-target-cpu-features=host \
    --iree-llvmcpu-target-triple=x86_64-linux-gnu \
    --iree-llvmcpu-enable-microkernels \
    --iree-stream-resource-index-bits=64 \
    --iree-vm-target-index-bits=64 \
    --iree-vm-bytecode-module-strip-source-map=true \
    --iree-util-zero-fill-elided-attrs \
    --iree-vm-target-truncate-unsupported-floats \
    --iree-codegen-check-ir-before-llvm-conversion=false \
    --iree-opt-const-expr-hoisting=False \
    --iree-opt-data-tiling \
    --mlir-print-ir-after-all \
    --iree-consteval-jit-debug \
    --mlir-disable-threading \
    -o transpose_extend_batch_matmul.vmfb \
    transpose_extend_batch_matmul.mlir \
    2> dump.mlir

Use the following branched for llvm and IREE
https://github.com/Max191/llvm-project/tree/quantized-matmul-data-tiling-ukernel-test-branch (can just pick the last commit)
https://github.com/Max191/iree/tree/new-quantized-ukernels-codegen

Ben Vanik · Answer 1 · Sat Nov 11 2023 09:14:58 GMT+0800 (China Standard Time)

#util.byte_pattern<1> will never fold - you have to use an actual constant tensor (dense<....>)

(byte patterns are ways of saying "I want this exact pattern of bytes on disk" and the compiler just treats them as opaque blobs)

Han-Chung Wang · Answer 2 · Sat Nov 11 2023 09:23:19 GMT+0800 (China Standard Time)

Thanks Max!

Ben, I will update the IR.

Quinn Dawkins · Answer 3 · Sat Nov 11 2023 10:16:17 GMT+0800 (China Standard Time)

Also I will add that for sub-byte types, it would be good to try to prevent const-eval from generating constant tensors in MLIR with sub-byte element types. Instead we can store an i8 or i32 tensor after const-eval and bitcast to the appropriate tensor. This will avoid the overhead of unpacking/repacking that we've had to deal with for sub-byte constants thus far.

Ben Vanik · Answer 4 · Sat Nov 11 2023 10:30:53 GMT+0800 (China Standard Time)

good idea - may be worth adding that casting intentionally too

Quinn Dawkins · Answer 5 · Sat Nov 11 2023 11:00:58 GMT+0800 (China Standard Time)

good idea - may be worth adding that casting intentionally too

Yep! This was the main reason I pushed on the bitcast op. Packing -> unpacking -> repacking large weight constants once we inevitably wanted consteval for data tiling was going to be terrible for compiler performance.

Quinn Dawkins · Answer 6 · Sun Nov 12 2023 09:22:16 GMT+0800 (China Standard Time)

I put together a draft of my idea in #15558, although it is still "contingent" on codegen support for these expression trees

(it kind of isn't given that we don't support sub-byte stores in codegen to begin with, meaning initializer or not, we will fail on such expression trees, but makes sense not to flip a flag because unsupported while it's still unsupported I guess).

Han-Chung Wang · Answer 7 · Tue Nov 14 2023 06:20:18 GMT+0800 (China Standard Time)

Here is the mlir file I used for const-eval:

#map = affine_map<(d0, d1, d2) -> (d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module{
  func.func @transpose_extend_batch_matmul(%arg0: tensor<32x128xi16>) -> tensor<11008x32xi32> {
    %cst = arith.constant dense<1> : tensor<11008x32x128xi4>
    %c0_i32 = arith.constant 0 : i32
    %0 = tensor.empty() : tensor<11008x32xi32>
    %1 = linalg.fill ins(%c0_i32 : i32) outs(%0 : tensor<11008x32xi32>) -> tensor<11008x32xi32>
    %2 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0, %cst : tensor<32x128xi16>, tensor<11008x32x128xi4>) outs(%1 : tensor<11008x32xi32>) {
    ^bb0(%in: i16, %in_0: i4, %out: i32):
      %3 = arith.extsi %in : i16 to i32
      %4 = arith.extui %in_0 : i4 to i32
      %5 = arith.muli %3, %4 : i32
      %6 = arith.addi %5, %out : i32
      linalg.yield %6 : i32
    } -> tensor<11008x32xi32>
    return %2 : tensor<11008x32xi32>
  }
}

After skimming through the log, I think there are two issues we need to fix:

The generic op is not vectorized: #15574
Be able to fold subview.memref away before sub-bypte emulation: #15575

The latter one is blocking codegen for sub-type store emulation.