[CPU] Too many kernel launches for some mmt4d ops
hanhanW opened this issue · comments
I observe that sometimes the distribution logic works very bad for mmt4d ops, which leads to large runtime overheads and potentially more cache miss. E.g.,
func.func @mmt4d(%arg0: tensor<16x1152x16x2xbf16>, %arg1: tensor<16384x1152x16x2xbf16>, %arg2: tensor<16x16384x16x16xf32>) -> tensor<16x16384x16x16xf32> {
%0 = linalg.mmt4d ins(%arg0, %arg1 : tensor<16x1152x16x2xbf16>, tensor<16384x1152x16x2xbf16>) outs(%arg2 : tensor<16x16384x16x16xf32>) -> tensor<16x16384x16x16xf32>
return %0 : tensor<16x16384x16x16xf32>
}
The perf of default (with ukernels) using 1-thread is 1400 ms on my machine. And the perf of easy tuning (with ukernels) can be 1000 ms. It could give us 1.4x improvements!
To repro the perf issue:
Default:
iree-compile \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-cpu=znver4 \
--iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu \
--iree-llvmcpu-enable-ukernels=mmt4d \
~/mmt4d.mlir -o /tmp/z.vmfb
iree-benchmark-module \
--device=local-task \
--task_topology_group_count=1 \
--module=/tmp/z.vmfb \
--function=mmt4d \
--input=16x1152x16x2xbf16 \
--input=16384x1152x16x2xbf16 \
--input=16x16384x16x16xf32
Bump two factors in distribution logic by 16x, which give us 1.4x improvement:
iree-compile \
--iree-hal-target-backends=llvm-cpu \
--iree-llvmcpu-target-cpu=znver4 \
--iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu \
--iree-llvmcpu-enable-ukernels=mmt4d \
--iree-llvmcpu-narrow-matmul-tile-bytes=1048576 \
--iree-llvmcpu-general-matmul-tile-bytes=1048576 \
~/mmt4d.mlir -o /tmp/z.vmfb
Short-term solution: Find a configuration that works reasonable for our cases.
Long-term solution: see #16410
The tile size choices concern only the outer M and N dimensions, which here are:
- M=16
- N = 16384
Even with tile sizes M=1, N=1, each dispatch call still does 1152161622 ops == 1.2 M ops. We know from mmt4d_benchmark on the target machine that it does ~ 200 Gflop/s on this kernel. So the dispatch call would take about 6 microseconds in that case. Probably not enough indeed, so yes we will want to increase that to the extent allowed by cache sizes.
The target machine's caches are L1 = 48k, L2 = 1M (per core), L3 = 16M or 24M (shared).
The LHS here is overall 1152k ; if tiled with M=1, each tile is 72k, so already just that exceeds L1 cache size.
The RHS here is overall 1152M (that's > 1 G !!). If tiled with N=1, again each tile is 72k, so again exceeds L1 cache size.
So even with M==N==1, we are already > 2.5x L1 cache size. So we know it's hopeless to try to fit in L1 here; we will have to aim for residency in L2 and hope for prefetching from L2 to L1 to happen ahead of us.
So, the two goals for a good distribution tiling here are:
- Traverse RHS only once. There are two ways to achieve that:
- Either explicitly change the traversal iteration to iterates on the N dimension first, then on the M dimension (a transposition of the traversal space).
- Or set M tile size == 16 so that there is no tiling at all on the M dimension.
- Have L2 reuse.
I'll think a bit more about it. Just if you want to try something quickly, I would do M=16, N=4 I think. M=16 means LHS exceeds L2 cache size, N=4 is chosen to offer some amortization of that but not so much that that would start putting much more pressure still on L2.
This was an issue from a sprint, closing the issue because there are no action items.