[CPU] Too many kernel launches for some mmt4d ops

Question

[CPU] Too many kernel launches for some mmt4d ops

hanhanW opened this issue 2 months ago · comments

I observe that sometimes the distribution logic works very bad for mmt4d ops, which leads to large runtime overheads and potentially more cache miss. E.g.,

func.func @mmt4d(%arg0: tensor<16x1152x16x2xbf16>, %arg1: tensor<16384x1152x16x2xbf16>, %arg2: tensor<16x16384x16x16xf32>) -> tensor<16x16384x16x16xf32> {
  %0 = linalg.mmt4d ins(%arg0, %arg1 : tensor<16x1152x16x2xbf16>, tensor<16384x1152x16x2xbf16>) outs(%arg2 : tensor<16x16384x16x16xf32>) -> tensor<16x16384x16x16xf32>
  return %0 : tensor<16x16384x16x16xf32>
}

The perf of default (with ukernels) using 1-thread is 1400 ms on my machine. And the perf of easy tuning (with ukernels) can be 1000 ms. It could give us 1.4x improvements!

To repro the perf issue:

Default:

iree-compile \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=znver4 \
  --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu \
  --iree-llvmcpu-enable-ukernels=mmt4d \
  ~/mmt4d.mlir -o /tmp/z.vmfb

iree-benchmark-module \
  --device=local-task \
  --task_topology_group_count=1 \
  --module=/tmp/z.vmfb \
  --function=mmt4d \
  --input=16x1152x16x2xbf16 \
  --input=16384x1152x16x2xbf16 \
  --input=16x16384x16x16xf32

Bump two factors in distribution logic by 16x, which give us 1.4x improvement:

iree-compile \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=znver4 \
  --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu \
  --iree-llvmcpu-enable-ukernels=mmt4d \
  --iree-llvmcpu-narrow-matmul-tile-bytes=1048576 \
  --iree-llvmcpu-general-matmul-tile-bytes=1048576 \
  ~/mmt4d.mlir -o /tmp/z.vmfb

Short-term solution: Find a configuration that works reasonable for our cases.

Long-term solution: see #16410

Benoit Jacob · Answer 1 · Fri May 17 2024 03:57:27 GMT+0800 (China Standard Time)

The tile size choices concern only the outer M and N dimensions, which here are:

M=16
N = 16384

Even with tile sizes M=1, N=1, each dispatch call still does 1152161622 ops == 1.2 M ops. We know from mmt4d_benchmark on the target machine that it does ~ 200 Gflop/s on this kernel. So the dispatch call would take about 6 microseconds in that case. Probably not enough indeed, so yes we will want to increase that to the extent allowed by cache sizes.

The target machine's caches are L1 = 48k, L2 = 1M (per core), L3 = 16M or 24M (shared).

The LHS here is overall 1152k ; if tiled with M=1, each tile is 72k, so already just that exceeds L1 cache size.
The RHS here is overall 1152M (that's > 1 G !!). If tiled with N=1, again each tile is 72k, so again exceeds L1 cache size.

So even with M==N==1, we are already > 2.5x L1 cache size. So we know it's hopeless to try to fit in L1 here; we will have to aim for residency in L2 and hope for prefetching from L2 to L1 to happen ahead of us.

So, the two goals for a good distribution tiling here are:

Traverse RHS only once. There are two ways to achieve that:
- Either explicitly change the traversal iteration to iterates on the N dimension first, then on the M dimension (a transposition of the traversal space).
- Or set M tile size == 16 so that there is no tiling at all on the M dimension.
Have L2 reuse.

I'll think a bit more about it. Just if you want to try something quickly, I would do M=16, N=4 I think. M=16 means LHS exceeds L2 cache size, N=4 is chosen to offer some amortization of that but not so much that that would start putting much more pressure still on L2.

Han-Chung Wang · Answer 2 · Tue Jul 02 2024 06:09:01 GMT+0800 (China Standard Time)

This was an issue from a sprint, closing the issue because there are no action items.