[CPU] Missing ukernel for f32, i8 -> f32 mmt4d?

Question

[CPU] Missing ukernel for f32, i8 -> f32 mmt4d?

dcaballe opened this issue 6 months ago · comments

This dispatch shows up after all the recent DT fixes. When we enable ukernels (all), it doesn't lower to a ukernel but to scalar code. I guess we don't have ukernels for f32, i8 -> f32 mmt4d? Could we add them for x86 and ARM?

It would also be a good idea to fall back to the right codegen path when a ukernel is not available/found.

Thanks!

hal.executable public @main_dispatch_1187 {
  hal.executable.variant public @system_elf_x86_64 target(<"llvm-cpu", "system-elf-x86_64", {cpu = "cascadelake", cpu_features = "+cmov,+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vnni,+adx,+clflushopt,+clwb,+cx16,+cx8,+crc32,+f16c,+fsgsbase,+fxsr,+invpcid,+lzcnt,+movbe,+pku,+prfchw,+rdrnd,+rdseed,+sahf,+x87,+xsave,+xsavec,+xsaveopt,+xsaves,+evex512", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", link_embedded = false, native_vector_size = 64 : index, target_triple = "x86_64-unknown-linux-gnu", ukernels = "all"}>) {
    hal.executable.export public @main_dispatch_1187_mmt4d_1x600x256x1x16x1_f32xi8xf32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>) {
    ^bb0(%arg0: !hal.device):
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @main_dispatch_1187_mmt4d_1x600x256x1x16x1_f32xi8xf32() {
        %c128 = arith.constant 128 : index
        %c6272 = arith.constant 6272 : index
        %cst = arith.constant dense<0> : tensor<600x256x16x1xi8>
        %cst_0 = arith.constant 0.000000e+00 : f32
        %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c128) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x256x1x1xf32>>
        %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c6272) : !flow.dispatch.tensor<writeonly:tensor<1x600x1x16xf32>>
        %2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [1, 256, 1, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x256x1x1xf32>> -> tensor<1x256x1x1xf32>
        %3 = tensor.empty() : tensor<1x600x1x16xf32>
        %4 = linalg.fill ins(%cst_0 : f32) outs(%3 : tensor<1x600x1x16xf32>) -> tensor<1x600x1x16xf32>
        %5 = linalg.mmt4d ins(%2, %cst : tensor<1x256x1x1xf32>, tensor<600x256x16x1xi8>) outs(%4 : tensor<1x600x1x16xf32>) -> tensor<1x600x1x16xf32>
        flow.dispatch.tensor.store %5, %1, offsets = [0, 0, 0, 0], sizes = [1, 600, 1, 16], strides = [1, 1, 1, 1] : tensor<1x600x1x16xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x600x1x16xf32>>
        return
      }
    }
  }
}

Benoit Jacob · Answer 1 · Fri Dec 01 2023 23:26:05 GMT+0800 (China Standard Time)

The first problem here is that we did materialize encoding for the [f32, i8, f32] combination of element types, which is not meant to be supported.

I can see the bug in the CPUMaterializeEncoding logic --- it was assuming that if the output element type was floating-point, then of course the lhs/rhs element type would be... that assumption is defeated here.

This will cause this matmul to revert to non-data-tiling. We can then discuss from scratch what should happen to it:

If you don't care too much about this use case, you can leave it as-is on the non-data-tiling path.
If you do care very much about this use case, then it's worth asking why this is trying to multiply f32's by i8's at all. We recently had a similar situation in some Llama2 quantized models and we found that we could quantize the f32's on the LHS into integers and then benefit from better-performing standard integer matmuls. This is our motivation for the si16 x ui4 code paths in data-tiling and ukernels (we quantized the f32's to si16 on the LHS, and the i8's were further shrunk to ui4's on the RHS). So, best to start by asking this kind of high-level questions.
If you care about this use case but for some reason have to leave it as-is f32xi8, a reasonable route would be to rewrite it to a sitofp on the RHS from i8 to f32, to let that go to a pure f32 matmul, which would benefit from optimizations such as data-tiling that are already in place.
If you really care very much about the extra performance from avoiding the separate i8->f32 traversal then we can consider having a dedicate data-tiling case for f32xi8 and a dedicated ukernel for that. But keep in mind that ukernels are a performance/engineering-cost trade-off and so there's only a narrow window where that 4. is the best route -- if you care just a little more then go to 2., while if you care a little less then go to 3.

Diego Caballero · Answer 2 · Mon Dec 04 2023 21:43:47 GMT+0800 (China Standard Time)

Let me run some numbers after integrating the latest changes so that I can tell you how much we care about this matmul flavor. Regarding changes at model level, perhaps @mariecwhite or @phoenix-meadowlark can advice on that.

There are also a couple of points that need addressing and seem more critical that what is stated above:

We should decouple DT from UK. I don't see a reason to not apply DT to f32, i8 -> f32 matmul at least when UK is not enabled. DT should just work for those cases and we shouldn't have any problem generating the code for this case or any other case. If possible, we should enable DT for all the cases when UK is disabled.
We should probably unify UK path with the code generation one. This should let us fall to an efficient code generation strategy when UK is enabled but no UK is implemented for a particular case, or perhaps we only have a slow UK and would prefer a more efficient outcome involving code generation. I think unifying the paths and playing with the existing flags that we have (and perhaps extending them a bit) should give coverage to all the scenarios we care about.

Thoughts?

Benoit Jacob · Answer 3 · Tue Dec 05 2023 00:08:17 GMT+0800 (China Standard Time)

1. We should decouple DT from UK. I don't see a reason to not apply DT to f32, i8 -> f32 matmul at least when UK is not enabled. DT should just work for those cases and we shouldn't have any problem generating the code for this case or any other case. If possible, we should enable DT for all the cases when UK is disabled.

To be clear, DT and UK are decoupled, just this particular issue was confusing here.

Obviously, I do agree in principle that DT is always unconditionally good to apply to matmuls, regardless of UK. There is one fundamental difficulty though about trying to think of DT as something that's just universally used: DT only makes sense when we know a specific tile shape to use for the case at hand. And don't know that in general. That is what is currently limiting our ability to just unconditionally DT all matmuls at the moment, even though DT is already on by default. Outside of the known cases for which we know a good DT tile shape, we just fall back on non-DT.

2. We should probably unify UK path with the code generation one. This should let us fall to an efficient code generation strategy when UK is enabled but no UK is implemented for a particular case, or perhaps we only have a slow UK and would prefer a more efficient outcome involving code generation. I think unifying the paths and playing with the existing flags that we have (and perhaps extending them a bit) should give coverage to all the scenarios we care about.

If I understand correctly this point 2., this is about ensuring that codegen is used whenever UK doesn't have a dedicated fast code path. In that case, that's in perfect alignment with a chat @benvanik and I had last week, and I just filed this issue to continue this line of discussion / get to implementation: #15784 .

Diego Caballero · Answer 4 · Tue Dec 05 2023 06:36:27 GMT+0800 (China Standard Time)

DT only makes sense when we know a specific tile shape to use for the case at hand

That's a good point and very timely as @pzread is working on a tile size selection infrastructure that should help with this and with the way we encode tile sizes for non-dt, dt and uk. This problem could also be one that we should target in the short term.

Let's keep the issue open until we can figure out if we would benefit from a ukernel for this one or we should do something else.

Han-Chung Wang · Answer 5 · Tue Dec 05 2023 08:06:42 GMT+0800 (China Standard Time)

DT only makes sense when we know a specific tile shape to use for the case at hand

That's a good point and very timely as @pzread is working on a tile size selection infrastructure that should help with this and with the way we encode tile sizes for non-dt, dt and uk. This problem could also be one that we should target in the short term.

I feel that we are mixing different context about tile size selection here. The data-tiling tile size selection is part of materialization, which is different from tile size selection infrastructure. The latter one is especially for the logic in SelectLoweringStrategy pass, where the implementation detail is in KernelDispatch.cpp. The former one is about how we materialize encodings to physical ops (e.g., pack/unpack/mmt4d ops); the latter one is tile size selection for physical ops (i.e., ops without encodings).

I know that you land a change which selects different tile sizes for codegen. I thought the reason is that we want to unroll more vectors. I'm treating it as a shortcut/workaround, because I expect vector unrolling is controlled by codegen pipeline. We should be able to invoke vector unrolling somewhere when they are ready. Please correct me if I misunderstood something, thanks!

DT only makes sense when we know a specific tile shape to use for the case at hand

In this context, we don't want to data-tile f32.i8.f32 cases. Because we don't know a specific tile shape (which is supposed to map to some instructions?). Thus, I think the issue is fixed.

mariecwhite · Answer 6 · Tue Dec 05 2023 10:44:45 GMT+0800 (China Standard Time)

Let me run some numbers after integrating the latest changes so that I can tell you how much we care about this matmul flavor. Regarding changes at model level, perhaps @mariecwhite or @phoenix-meadowlark can advice on that.

Is this an i8 only model or sub-byte? Do you mind filing a bug internally with repro steps and we can look into it.

Diego Caballero · Answer 7 · Wed Dec 06 2023 19:43:44 GMT+0800 (China Standard Time)

I feel that we are mixing different context about tile size selection here.

The mix is intentional :) We are selecting tile sizes for one purpose or the other, which shouldn't really matter. We need a tile size selection infra that can serve any tile size selection purpose in a modular way that scales. The fact that non-DT and DT tile size selection is so disconnected is problematic, as this issue illustrates.

Let's keep the issue open until we can figure out if we would benefit from a ukernel for this one or we should do something else.

I can confirm that we need more work here. The dispatch is still the hottest one in the model after enabling proper vectorization. The matmul is large and we are not doing any memory optimizations yet on the non-DT so we need DT at least.

@pzread, do you think you could help with this? I think we would need: 1) add tile sizes for the [f32, i8, f32] case to enable DT with no UKs and 2) Introduce a "UK legalization" pass that can convert this unsupported cases into supported ones (e.g., [f32, i8, f32] -> cvt i8 -> f32 + [f32, f32, f32]). This pass should only trigger when UK is enabled. Please, feel free to discuss the details with Hanhan and Benoit.

Jerry Wu · Answer 8 · Thu Dec 07 2023 04:17:00 GMT+0800 (China Standard Time)

Sure, I'll take a look this week

Jerry Wu · Answer 9 · Thu Dec 07 2023 05:48:12 GMT+0800 (China Standard Time)

So my understanding of the TODO tasks are:

Currently we don't support DT for [f32, i8, f32]. But we should experiment with DT + codegen to see if a proper tile sizes can actually improve the performance comparing to non-DT.
Convert unsupported data types to supported data types for UK and see if the performance can be improved. I'm not exactly sure where we should do the conversion. My initial thought is to insert arith.sitofp before the tensor.pack, so there is a chance that we can fuse the generic + sitofp + pack.

Jerry Wu · Answer 10 · Sun Dec 10 2023 13:15:25 GMT+0800 (China Standard Time)

I have a prototype to promote matmul for available ukernel: #15873

I think one issue is currently we don't fuse arith.sitofp into ukernel dispatch. AFAIK there are two potential ways:

Fuse arith.sitofp with tensor.pack. This means we have two dispatches (generic) + sitofp + pack and ukernel
Fuse arith.sitofp with ukernel. This means we have two dispatches (generic) + pack and sitofp + ukernel

The problem of 1. is we might create an intermediate large fp32 tensor, which I think defeats the original purpose to use i8 tensor?

The 2. is more memory friendly as we can potentially tile sitofp with ukernel, so only needs small or no temporary memory for fp32 if we can vectorize them. However IIUC currently we can't fuse sitofp with ukernel, but it looks like #15826 (comment) will add support to fuse them.