iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

Home Page:http://iree.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data race between worker threads sharing a buffer

lialan opened this issue · comments

What happened?

Recently a patch is revered: ab8f668 as it caused intermittent test failures on ARM64 builds.

Further investigation shows that on arm64, threadsanitizer complains data race in memset, however the tsan stack trace does not give much insight into which part exactly caused it. ASan did not complain on failed test runs.

The reproduce rate is <10%, MacOS and Arm64 Linux are reproducible. The TSan log is attached.
LastTest.log

Steps to reproduce your issue

  1. revert the revert patch: ab8f668
  2. compile with option -DIREE_ENABLE_TSAN=ON on an ARM machine
  3. run with ctest -R narrow_n_matmul and see the error.

What component(s) does this issue relate to?

Runtime

Version information

commit ab8f668

Additional context

No response

Interpreting the TSan log:

Two worker threads, Thread T57 'iree-worker-0' and Thread T64 'iree-worker-7', are accessing a shared memory buffer without adequate ordering.

T57 is trying to map the buffer, which thanks to the IREE_HAL_MEMORY_ACCESS_DISCARD flag + assertions enabled, is a memory write, helping making this reproduce consistently in TSan:

  Write of size 8 at 0xffffee603b80 by thread T57:
    #0 memset <null> (iree-check-module+0x15511c) (BuildId: d50e3fb55e5361f2)
    #1 iree_hal_heap_buffer_map_range /Users/lialan/workspace/iree/runtime/src/iree/hal/buffer_heap.c:270:5 (iree-check-module+0x24be8c) (BuildId: d50e3fb55e5361f2)

Note: the buffer_heap.c code in above frame #1 writing those 0xCD bytes:

#ifndef NDEBUG
if (iree_any_bit_set(memory_access, IREE_HAL_MEMORY_ACCESS_DISCARD)) {
memset(mapping->contents.data, 0xCD, local_byte_length);
}
#endif // !NDEBUG

T64 had previously written to the buffer inside of dispatch function code:

  Previous write of size 8 at 0xffffee603b80 by thread T64:
    #0 <null> <null> (iree_dylib_CtHR1K_mem_.so+0x12ee8)
    #1 iree_hal_system_executable_issue_call /Users/lialan/workspace/iree/runtime/src/iree/hal/local/loaders/system_library_loader.c:331:13 (iree-check-module+0x4a90a8) (BuildId: d50e3fb55e5361f2)

A question for @benvanik - when two worker threads cooperate on a buffer, where is the synchronization code that is meant to prevent a data race? Is there code meant to ensure that no two tasks sharing a buffer can be scheduled concurrently?

There's nothing that implicitly synchronizes access. If two concurrently executing operations access the same locations in memory they must use atomics to do their work.

I don't know what the above is, but that looks like a miscompile in codegen or a TSAN misidentification: a command buffer copy and a dispatch should not be accessing the same memory concurrently.

(or benoit will school me on some fun memory system detail again and we'll find something juicy to fix ;)

That's what I'm asking about: when you say "a command buffer copy and a dispatch should not be accessing the same memory concurrently", what specific piece of code implements that "should not" ?

The compiler must insert barriers - the lack of barriers implies no synchronization is required. The barriers show as structured statements in the stream IR with any sibling operations within a stream.async.concurrent/stream.cmd.concurrent having no barriers between them and all peer operations to the concurrent having barriers. In HAL the barriers are hal.command_buffer.execution_barrier, and at runtime they are recorded as iree_hal_command_buffer_execution_barrier. On the task execution system the barriers become edges between the tasks in the task graph. If the compiler has an execution barrier between a dispatch and a copy then at runtime the task issuing the last workgroup of a dispatch will be what queues up the tasks for the copies and no copy task will run before any of the dispatch workgroup tasks have completed.

Oh i see, it's in the compiler generated code. I was mistakenly looking for barriers in the runtime. OK, so that means we should be looking at IR. Can you suggest where to look?

@lialan can you edit the source of that narrow_n_matmul test to keep only the failing testcase module.batch_matmul_narrow_n_2 and then compile it with -mlir-disable-threading -mlir-print-ir-after-all and upload the resulting log?

--compile-to=stream will show you the stream level and you'll be able to see if the operations are expected to be run concurrently (if they're in a stream.cmd.concurrent together)

(but an ir-after-all dump is best - then you can search from the bottom up to the last stream.cmd.concurrent, or last stream.async.concurrent which can be easier to read, etc)

(message crossed your edit --- OK, let's do both @lialan ) . ah ok. that's going to be much lighter than -mlir-print-ir-after-all, thanks.

@benvanik @bjacob Attaching the dumps:
This one generated by: ./tools/iree-compile --iree-hal-target-backends=llvm-cpu --iree-global-opt-enable-early-materialization=false --compile-to=stream -mlir-disable-threading:
stream_out.txt

This one is just print-after-all:
print_after_all.txt

lol your print ir after all dump is truncated because TSAN is catching a signal:

iree-compile: iree/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp:339: mlir::iree_compiler::TileMxNxK mlir::iree_compiler::chooseMatmulTile(ArrayRef<mlir::iree_compiler::TileMxNxK>, int64_t, int64_t, ArrayRef<int64_t>): Assertion `(hostDefinedUpperBound.empty() || hostDefinedUpperBound.size() == 3) && "expected hostDefinedUpperBound is empty or has upper bound for {M, " "N, K}"' failed.

that may be the root cause of the issue here and you should check on that.

The IR as it pertains to runtime scheduling seems fine to me.
This reads as 3 dispatches doing set_encoding concurrently reading and writing independent buffer ranges (I don't know why you're encoding the result - seems silly, but whatever :), a matmul consuming those results, and then an unset_encoding and copy concurrently operating independent buffer ranges.

    %11 = stream.cmd.execute await(%10) => with(%2 as %arg0: !stream.resource<constant>{%7}, %4 as %arg1: !stream.resource<constant>{%8}, %6 as %arg2: !stream.resource<constant>{%9}, %__constant_tensor_2x3x2xi32 as %arg3: !stream.resource<constant>{%c256}, %result as %arg4: !stream.resource<external>{%c128}, %result_0 as %arg5: !stream.resource<transient>{%c3072}) {
      stream.cmd.concurrent {
        stream.cmd.dispatch @_batch_matmul_narrow_n_2_dispatch_0::@_batch_matmul_narrow_n_2_dispatch_0_set_encoding_LHS_2x3x4 {
          ro %arg0[%c0 for %7] : !stream.resource<constant>{%7},
          wo %arg5[%c0 for %c512] : !stream.resource<transient>{%c3072}
        }
        stream.cmd.dispatch @_batch_matmul_narrow_n_2_dispatch_1::@_batch_matmul_narrow_n_2_dispatch_1_set_encoding_RHS_2x4x2 {
          ro %arg1[%c0 for %8] : !stream.resource<constant>{%8},
          wo %arg5[%c512 for %c512] : !stream.resource<transient>{%c3072}
        }
        stream.cmd.dispatch @_batch_matmul_narrow_n_2_dispatch_2::@_batch_matmul_narrow_n_2_dispatch_2_set_encoding_RESULT_2x3x2 {
          ro %arg2[%c0 for %9] : !stream.resource<constant>{%9},
          wo %arg5[%c1024 for %c2048] : !stream.resource<transient>{%c3072}
        }
      }
      stream.cmd.dispatch @_batch_matmul_narrow_n_2_dispatch_3::@_batch_matmul_narrow_n_2_dispatch_3_batch_matmul_2x3x2x4_i8xi8xi32 {
        ro %arg5[%c0 for %c512] : !stream.resource<transient>{%c3072},
        ro %arg5[%c512 for %c512] : !stream.resource<transient>{%c3072},
        rw %arg5[%c1024 for %c2048] : !stream.resource<transient>{%c3072}
      }
      stream.cmd.concurrent {
        stream.cmd.dispatch @_batch_matmul_narrow_n_2_dispatch_4::@_batch_matmul_narrow_n_2_dispatch_4_unset_encoding_RESULT_2x3x2 {
          ro %arg5[%c1024 for %c2048] : !stream.resource<transient>{%c3072},
          wo %arg4[%c0 for %c48] : !stream.resource<external>{%c128}
        }
        stream.cmd.copy %arg3[%c0], %arg4[%c64], %c48 : !stream.resource<constant>{%c256} -> !stream.resource<external>{%c128}
      }
    } => !stream.timepoint

Since there's no reuse here there's really not anything within the two concurrent regions that could have TSAN conflicts unless the dispatches are incorrectly touching memory. For example, if the unset_encoding is touching %arg4 outside of the range of [0,48) bytes then it will stomp on the copy touching [64,112).

The address at fault in the TSAN report is 256B into the allocation but we slab allocate so that doesn't mean 256B into the buffer (as there's the buffer header in there). The HAL IR or a runtime --trace_execution dump would show what the HAL is doing with its copy buffer (I'm guessing it'll be offset 64 for length 48 as in the IR above), in which case the unset_encoding may be the culprit.

(I do suspect that assertion is guarding correct behavior, and we really should not be relying on assertions for things like that - if analysis fails then we need to error out properly) /cc @hanhanW

lol your print ir after all dump is truncated because TSAN is catching a signal:
iree-compile: iree/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp:339: mlir::iree_compiler::TileMxNxK mlir::iree_compiler::chooseMatmulTile(ArrayRefmlir::iree_compiler::TileMxNxK, int64_t, int64_t, ArrayRef<int64_t>): Assertion `(hostDefinedUpperBound.empty() || hostDefinedUpperBound.size() == 3) && "expected hostDefinedUpperBound is empty or has upper bound for {M, " "N, K}"' failed.

Darn, thanks, that is indeed related to the PR that triggers this. Looking into this.

(I don't know why you're encoding the result - seems silly, but whatever :)

it's just a misnomer - we're encoding the input accumulator here; encodings don't care about input vs output though, so it's really just the accumulator encoding, which got named the result encoding, which is fair when we're talking about the output accumulator, which is the result, but is confusing when talking about the input accumulator.

I see, the assert failure reproduces only with --iree-global-opt-enable-early-materialization=false.

It's actually a red herring / unrelated issue. In particular, --iree-global-opt-enable-early-materialization=false is only passed in the e2e matmul tests but not in that narrow_n_matmul test that is also failing. The assert only fails for batch cases because the 4th dimension causes that array to have size 4, defeating the assertion that it has to be 0 or 3. The fix is trivial but doesn't change the race condition we're observing at runtime.

Fix:

diff --git a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp b/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp
index c1948b9418..64e9ad3216 100644
--- a/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp
@@ -334,7 +334,7 @@ static TileMxNxK
 chooseMatmulTile(ArrayRef<TileMxNxK> enumeratedTiles, int64_t matmulNarrowM,
                  int64_t matmulNarrowN,
                  ArrayRef<int64_t> hostDefinedUpperBound = {}) {
-  assert((hostDefinedUpperBound.empty() || hostDefinedUpperBound.size() == 3) &&
+  assert((hostDefinedUpperBound.empty() || hostDefinedUpperBound.size() >= 3) &&
          "expected hostDefinedUpperBound is empty or has upper bound for {M, "
          "N, K}");
   // Handle narrow-N by transposing to reduce to narrow-M. Note: the
diff --git a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
index 83874316f8..cfb11c36f0 100644
--- a/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/EncodingUtils.cpp
@@ -65,10 +65,10 @@ static RankedTensorType transposeIfNarrowNResult(RankedTensorType tensorType) {
 
   // auto newRoundDimsTo = encoding.getRoundDimsToArray();
   SmallVector<int64_t> newRoundDimsTo(encoding.getRoundDimsToArray());
-  assert(newRoundDimsTo.size() == 0 || newRoundDimsTo.size() == 3);
-  if (newRoundDimsTo.size() != 0)
-    std::swap(newRoundDimsTo[0], newRoundDimsTo[1]);
-
+  assert(newRoundDimsTo.size() == 0 || newRoundDimsTo.size() >= 3);
+  if (newRoundDimsTo.size() != 0) {
+    std::swap(newRoundDimsTo[newRoundDimsTo.size() - 3], newRoundDimsTo[newRoundDimsTo.size() - 2]);
+  }
   auto context = tensorType.getContext();
   AffineMap permutation = AffineMap::getPermutationMap(permIndices, context);
   for (auto &map : maps) {

However: that is a lot of gratuitous complexity here from the fact that RoundDimsTo is an array... while in practice all entries in it are initialized as 16 in SetEncoding. We should make RoundDimsTo a single scalar for all dimensions that this rounding may apply to, which would simplify a lot of code @hanhanW (also, we would never want to apply that to the batch dimension, so the 16 there was meaningless).

Here is the print-ir-after-all regenerated with that so that we get past this assertion:
https://gist.githubusercontent.com/bjacob/d88eed55cb9b57771f1ddbe1b9cbaa05/raw/7efc2fb97173603a9e8bd86cbc53f96b6be9d544/log.mlir
@benvanik PTAL?

@benvanik , for this to manifest only on arm64, this must be almost correct and only incorrect in some lower-level detail. So it's not surprising that looking at Stream above, everything was OK. I'm curious how these things get lowered down to synchronization primitives later on (HAL?) hopefully something we can see now with the print-ir-after-all in the previous comment (getting past that assertion).

Can you pinpoint the ops implementing both sides of the synchronization that should be happening, and point to their C implementations, so we can start to reason about what they are doing?

There's no difference across targets on the runtime side. If this is only showing with arm64 you'll definitely need to look at your dispatches. Perhaps it's something that TSAN on other targets is not catching that arm64 is though that's doubtful that there's target-specific functionality in TSAN - though maybe it's target-specific rules that only apply there. AFAIK the only variances in sanitizers have been around aligned/unaligned accesses and such, but sanitizers work in mysterious ways (sometimes AFAICT not even known to the creators :).

Nearly all times there's been an issue with clobbering data/races it's been in the dispatches. The only thing the host-side compiler/runtime does is provide appropriate buffer ranges to the dispatches and here they look correct. Assuming it's the dispatch_4 unset_encoding overlapping with the copy the dispatch_4 should only be writing bytes 0 to 48 (given %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>>) and the copy is scheduled to write 64 to 112 (given hal.command_buffer.copy_buffer<%cmd : !hal.command_buffer> source(%__constant_tensor_2x3x2xi32 : !hal.buffer)[%c0] target(%transient_buffer : !hal.buffer)[%c64] length(%c48)). I doubt there's a bug (but would gladly buy someone a beer if there was!) in the copy_buffer, so if the two are writing the same ranges concurrently then it's got to be in the dispatch. You can printf in iree_hal_command_buffer_copy_buffer or the iree_hal_heap_buffer_map_range to verify the range.

(if you can tell me 100% that the dispatch is never possibly accessing a range the copy is and the copy is writing to the correct range we can talk about how tasks are scheduled - I don't really think it's useful to go down that route until confident, though)

There's no difference across targets on the runtime side. If this is only showing with arm64 you'll definitely need to look at your dispatches.

(if you can tell me 100% that the dispatch is never possibly accessing a range the copy is and the copy is writing to the correct range we can talk about how tasks are scheduled - I don't really think it's useful to go down that route until confident, though)

Well it is true that this issue occurred on ARM64 CI. However given the fact that this is intermittent and recurrence rate is not 100%, we cannot rule out the possibility that it can be reproduced on other platforms as well.

@bjacob given the same configuration, is it true that we will reach a same/similar execution path on x86? Worth trying out?

Sure, worth retrying. The last time I tried on x86, I didn't retry enough times to be 100% sure.

I have pushed my fixes to the above assertion failure, on top of your transpose_rebase branch, to this branch on origin:

https://github.com/iree-org/iree/tree/bjacob-transpose-rebase-repro

Also sharing the small reproducer with just the failing testcase:

https://gist.github.com/bjacob/f248a132c0a9af929047b898936fa8c0

And my command line to build and run this with retrying until failure:

ninja iree-compile && tools/iree-compile --iree-hal-target-backends=llvm-cpu /tmp/a.mlir -o /tmp/a.vmfb 
&& for i in `seq 1 100`; do tools/iree-check-module --module=/tmp/a.vmfb --device=local-task || break; done; echo; ec
ho "Failed after retrying $i times."

Here are my observations on Mac/arm64:

  • This reproduces only with --iree-hal-target-backends=llvm-cpu, not vmvx, not even with vmvx microkernels.
  • On llvm-cpu:
    • This reproduces with data-tiling (default). This does not reproduce when data-tiling is disabled.
    • This reproduces the same with or without early-materialization.
    • This reproduces with or without ukernels. However:
      • Without ukernels, only the first batch is incorrectly zero, the second batch is correct.
      • With ukernels, both batches are incorrectly zero.

Also, since we are considering the possibility of dispatches writing out of bounds, we should check ASan.

Tried >10000 runs with @bjacob 's branch and the script, 0 failure on x86.

From the trace, we see that the data race is between a dispatch that is the unpack of the matmul output, and a copy_buffer that is copying the golden test results that are passed as the RHS to check.expect_eq.

Notice that in the error message, the erroneous side is the RHS --- which is, the golden results:

  check.expect_eq_const(%result, dense<[
    [[0, 11], [7, -1], [-17, 14]],
    [[-11, -1], [-29, -15], [-24, -18]]
  ]> : tensor<2x3x2xi32>) : tensor<2x3x2xi32>

so the dispatch 4, which does the unpack, is overwriting the golden results with zeros.

Looking at the trace, this makes sense as both the dispatch 4 and the copy_buffer are scheduled concurrently writing to the same destination buffer. They are writing with different offsets that should be safe, but all is consistent with a bug in dispatch 4 writing zeros past the end of its normal output span.

[module.batch_matmul_narrow_n_2+000003AD]    vm.call.varadic @hal.command_buffer.push_descriptor_set(%r10(!hal.command_buffer/0x0x13f00a800), %r2(!hal.pipeline_layout/0x0x13e904e90), %i9(0), %i9(0), %i9(0), %r0(!hal.buffer/0x0x13e9053c0), %i16(0), %i28(256), %i6(1), %i9(0), %r9(!hal.buffer/0x0x13e905200), %i16(0), %i22(128))
[module.batch_matmul_narrow_n_2+000003DA]    vm.call @hal.command_buffer.dispatch(%r10(!hal.command_buffer/0x0x13f00a800), %r3(!hal.executable/0x0x13e904f30), %i4(4), %i3(2), %i6(1), %i6(1))
[module.batch_matmul_narrow_n_2+000003F0]    vm.br ^000003F8()
[module.batch_matmul_narrow_n_2+000003F9]    vm.call @hal.command_buffer.copy_buffer(%r10(!hal.command_buffer/0x0x13f00a800), %r1(!hal.buffer/0x0x13e904980), %i16(0), %r9(!hal.buffer/0x0x13e905200), %i26(64), %i20(48))

Here we see that both are writing to the buffer at 0x0x13e905200, which has size 128.

  • dispatch 4 should be writing 48 bytes at offset 0.
  • copy_buffer is writing 48 bytes at offset 64.

So this looks like dispatch 4 is writing zeros past the end of its normal output span.
Specifically:

  • With default codegent, it overwrites the first 24 bytes of the golden values, meaning offsets 64--88, meaning it's writing 40 bytes of zeros past its normal end at offset 48.
  • With ukernels, it overwrites the whole 48 bytes of golden values, meaning offsets 64--112, meaning it's writing 64 bytes past its normal end at offset 48.

(So this is exactly what Ben was saying, and our next step is to look closely at arm64 codegen.)

This is the matmul dispatch, the barrier, and the concurrent unset_encoding dispatch + copy:

[module.batch_matmul_narrow_n_2+0000033F]    vm.call.varadic @hal.command_buffer.push_descriptor_set(%r10(!hal.command_buffer/0x0x13f00a800), %r2(!hal.pipeline_layout/0x0x13e904e90), %i9(0), %i9(0), %i9(0), %r0(!hal.buffer/0x0x13e9053c0), %i16(0), %i28(256), %i6(1), %i9(0), %r0(!hal.buffer/0x0x13e9053c0), %i16(0), %i28(256))
[module.batch_matmul_narrow_n_2+0000036C]    vm.call @hal.command_buffer.dispatch(%r10(!hal.command_buffer/0x0x13f00a800), %r3(!hal.executable/0x0x13e904f30), %i5(3), %i3(2), %i6(1), %i6(1))
[module.batch_matmul_narrow_n_2+00000382]    vm.br ^0000038A()
[module.batch_matmul_narrow_n_2+0000038B]    vm.call @hal.command_buffer.execution_barrier(%r10(!hal.command_buffer/0x0x13f00a800), %i2(28), %i1(13), %i9(0))
[module.batch_matmul_narrow_n_2+0000039C]    vm.cond_br %i7(1), ^000003AC(), ^000003F8()
[module.batch_matmul_narrow_n_2+000003AD]    vm.call.varadic @hal.command_buffer.push_descriptor_set(%r10(!hal.command_buffer/0x0x13f00a800), %r2(!hal.pipeline_layout/0x0x13e904e90), %i9(0), %i9(0), %i9(0), %r0(!hal.buffer/0x0x13e9053c0), %i16(0), %i28(256), %i6(1), %i9(0), %r9(!hal.buffer/0x0x13e905200), %i16(0), %i22(128))
[module.batch_matmul_narrow_n_2+000003DA]    vm.call @hal.command_buffer.dispatch(%r10(!hal.command_buffer/0x0x13f00a800), %r3(!hal.executable/0x0x13e904f30), %i4(4), %i3(2), %i6(1), %i6(1))
[module.batch_matmul_narrow_n_2+000003F0]    vm.br ^000003F8()
[module.batch_matmul_narrow_n_2+000003F9]    vm.call @hal.command_buffer.copy_buffer(%r10(!hal.command_buffer/0x0x13f00a800), %r1(!hal.buffer/0x0x13e904980), %i16(0), %r9(!hal.buffer/0x0x13e905200), %i26(64), %i20(48))
[module.batch_matmul_narrow_n_2+0000040E]    vm.call @hal.command_buffer.execution_barrier(%r10(!hal.command_buffer/0x0x13f00a800), %i2(28), %i1(13), %i9(0))

it reads like the HAL IR just with the values available, so we can see for example the unset_encoding take buffer %r9 range [0, 128) as its binding and then the copy buffer target being %r9 range [64, (64+48)). In this case we had all the values in the IR and there wasn't a risk of that being wrong but it's a useful debugging tool when dynamic values are in play :)

ASAN won't show issues by default because it's not likely writing out of bounds of the buffer but instead maybe over a slice of the buffer it shouldn't be. I'll make it a flag (like I've meant to for years!) today but swapping packStaticSlicesGreedily to packSlicesWithNoAliasing here is one useful knob:

offset = packStaticSlicesGreedily(packOp, offset, staticSlices,
resourceConfig, indexSet, builder);
// TODO(benvanik): make this an option; it can be useful for debugging
// this code.
// offset = packSlicesWithNoAliasing(packOp, offset, staticSlices,
// resourceConfig, indexSet, builder);

actually, @bjacob you deleted that function :P d991dc9 well, I'll deal with that later I guess.

In this case I doubt it will change things as nothing is aliasing, but it's another useful tool for explorations like this.

While writing this I see your posts above so none of this is likely useful, but thought I'd get it out of my head :)

Here is the disassembly of the actual arm64 module code when it reproduced, for that dispatch. Assuming that nothing outside of the target library call function would be writing zeros to buffers, the faulty zeros writes have to be done by this code?

0000000000003898 <__batch_matmul_narrow_n_2_dispatch_4_unpack_i32>:
    3898: b9400048     	ldr	w8, [x2]
    389c: 7100051f     	cmp	w8, #0x1
    38a0: 540003a8     	b.hi	0x3914 <__batch_matmul_narrow_n_2_dispatch_4_unpack_i32+0x7c>
    38a4: a9bf7bfd     	stp	x29, x30, [sp, #-0x10]!
    38a8: 910003fd     	mov	x29, sp
    38ac: f940102a     	ldr	x10, [x1, #0x20]
    38b0: b9400c29     	ldr	w9, [x1, #0xc]
    38b4: 5280030b     	mov	w11, #0x18              ; =24
    38b8: a940294c     	ldp	x12, x10, [x10]
    38bc: 9bab290d     	umaddl	x13, w8, w11, x10
    38c0: 8b09052b     	add	x11, x9, x9, lsl #1
    38c4: d37ae52a     	lsl	x10, x9, #6
    38c8: 8b08198c     	add	x12, x12, x8, lsl #6
    38cc: d37df16b     	lsl	x11, x11, #3
    38d0: 910081ad     	add	x13, x13, #0x20
    38d4: 9102818c     	add	x12, x12, #0xa0
    38d8: ad7f8182     	ldp	q2, q0, [x12, #-0x10]
    38dc: 8b090108     	add	x8, x8, x9
    38e0: 3cde0181     	ldur	q1, [x12, #-0x20]
    38e4: 3dc00583     	ldr	q3, [x12, #0x10]
    38e8: f100091f     	cmp	x8, #0x2
    38ec: 8b0a018c     	add	x12, x12, x10
    38f0: 4e803824     	zip1.4s	v4, v1, v0
    38f4: 4e807820     	zip2.4s	v0, v1, v0
    38f8: 4e833841     	zip1.4s	v1, v2, v3
    38fc: 4e837842     	zip2.4s	v2, v2, v3
    3900: ad3f01a4     	stp	q4, q0, [x13, #-0x20]
    3904: ad0009a1     	stp	q1, q2, [x13]
    3908: 8b0b01ad     	add	x13, x13, x11
    390c: 54fffe63     	b.lo	0x38d8 <__batch_matmul_narrow_n_2_dispatch_4_unpack_i32+0x40>
    3910: a8c17bfd     	ldp	x29, x30, [sp], #0x10
    3914: 2a1f03e0     	mov	w0, wzr
    3918: d65f03c0     	ret

Given the dispatch is 2x1x1 you know it's called twice and there should be two iree_hal_cmd_dispatch_tile calls.
that function is what sets up the args:
https://github.com/iree-org/iree/blob/main/runtime/src/iree/hal/drivers/local_task/task_command_buffer.c#L842-L879

The dispatch functions take 3 args: environment, dispatch state, and workgroup state:
https://github.com/iree-org/iree/blob/main/runtime/src/iree/hal/local/loaders/embedded_elf_loader.c#L210-L212
on arm64 that should be (I believe) x0, x1, x2.

    3898: b9400048     	ldr	w8, [x2]   // workgroup_id_x
    38ac: f940102a     	ldr	x10, [x1, #0x20]  // binding_ptrs[]
    38b0: b9400c29     	ldr	w9, [x1, #0xc]  // workgroup_count_x
    38b8: a940294c     	ldp	x12, x10, [x10]  // binding_ptrs[0]

... I can't immediately find where binding_ptrs[1] gets retrieved (hooray compilers)?

    3900: ad3f01a4     	stp	q4, q0, [x13, #-0x20]
    3904: ad0009a1     	stp	q1, q2, [x13]

looks a bit suspicious though for a function expected to write only 48 bytes? I don't know arm64, but that reads like two 256bit stores at -32 and 0 meaning 64 bytes per loop iteration, and we know the function is dispatched twice? (maybe only one loop each, but that's always more than 48?)

(haven't had coffee yet, hopefully that's not all wrong :)

looks a bit suspicious though for a function expected to write only 48 bytes? I don't know arm64, but that reads like two 256bit stores at -32 and 0 meaning 64 bytes per loop iteration, and we know the function is dispatched twice? (maybe only one loop each, but that's always more than 48?)

(EDIT - In fact my local patch was not taking effect at all so disregard this)

Yes, sorry, that is caused by a local patch that I have in the above-referenced branch that sets the TileMxNxK tile size to 8, 8, 4 to remove one of the ways that things differ based on target details. Funny, this doesn't really change the symptoms, but I'll now try without it / with other values.

So, here is where the specific padding sizes are coming from. This is a i8 x i8 -> i32 matmul on baseline arm64, so our logic in CPUMaterializeEncodingPass returns these possible tile sizes:

    return {
        TileMxNxK{4, 16, 2},
        TileMxNxK{2, 16, 2},
        TileMxNxK{1, 16, 2},
    };

Now, the result shape in the source program is 2(batch) x 3(M) x 2 (N) x i32.
This has both narrow M and narrow N, but N is smaller, so it triggers our transpose-narrow-N logic (the PR that triggers the whole thing here) and the result shape becomes:

2(batch) x 2(M) x 3(N) x i32

Looking at the above TileMxNxK, we have one for M==2, it's TileMxNxK{2, 16, 2}, meaning M=2, N=16, K=2, so we select that one, and in terms of padding our result matrix, that becomes:

2(batch) x 2(M) x 16(N) x i32

So each tile is 2x16xi32 = 128 bytes.

    38d8: ad7f8182     	ldp	q2, q0, [x12, #-0x10]
    38dc: 8b090108     	add	x8, x8, x9
    38e0: 3cde0181     	ldur	q1, [x12, #-0x20]
    38e4: 3dc00583     	ldr	q3, [x12, #0x10]

This reads 64 contiguous bytes, so that's half a tile. Our loop should run for 2 iterations to complete the tile.

    3900: ad3f01a4     	stp	q4, q0, [x13, #-0x20]
    3904: ad0009a1     	stp	q1, q2, [x13]

That is storing 64 contiguous bytes: that is wrong. The destination shape is an incomplete tile, so there needs to be some extract_slice codegen.

; Function Attrs: <Unknown attribute>
define internal i32 @_batch_matmul_narrow_n_2_dispatch_4_unpack_i32(ptr noalias nonnull align 16 %0, ptr noalias nonnull align 16 %1, ptr noalias nonnull align 16 %2) #0 !dbg !111 {
  %4 = load %iree_hal_executable_dispatch_state_v0_t, ptr %1, align 8, !dbg !112
  %5 = extractvalue %iree_hal_executable_dispatch_state_v0_t %4, 10, !dbg !112
  %6 = load ptr, ptr %5, align 8, !dbg !112
  %7 = getelementptr i32, ptr %6, i32 32, !dbg !112
  %8 = ptrtoint ptr %7 to i64, !dbg !112
  %9 = and i64 %8, 63, !dbg !112
  %10 = icmp eq i64 %9, 0, !dbg !112
  call void @llvm.assume(i1 %10), !dbg !112
  %11 = load %iree_hal_executable_dispatch_state_v0_t, ptr %1, align 8, !dbg !113
  %12 = extractvalue %iree_hal_executable_dispatch_state_v0_t %11, 10, !dbg !113
  %13 = getelementptr ptr, ptr %12, i32 1, !dbg !113
  %14 = load ptr, ptr %13, align 8, !dbg !113
  %15 = ptrtoint ptr %14 to i64, !dbg !113
  %16 = and i64 %15, 63, !dbg !113
  %17 = icmp eq i64 %16, 0, !dbg !113
  call void @llvm.assume(i1 %17), !dbg !113
  %18 = load %iree_hal_executable_workgroup_state_v0_t, ptr %2, align 8, !dbg !114
  %19 = extractvalue %iree_hal_executable_workgroup_state_v0_t %18, 0, !dbg !114
  %20 = zext i32 %19 to i64, !dbg !114
  %21 = load %iree_hal_executable_dispatch_state_v0_t, ptr %1, align 8, !dbg !114
  %22 = extractvalue %iree_hal_executable_dispatch_state_v0_t %21, 4, !dbg !114
  %23 = zext i32 %22 to i64, !dbg !114
  br label %24, !dbg !115

24:                                               ; preds = %27, %3
  %25 = phi i64 [ %77, %27 ], [ %20, %3 ]
  %26 = icmp slt i64 %25, 2, !dbg !115
  br i1 %26, label %27, label %78, !dbg !115

27:                                               ; preds = %24
  %28 = mul i64 %25, 16, !dbg !115
  %29 = add i64 %28, 0, !dbg !115
  %30 = add i64 %29, 0, !dbg !115
  %31 = add i64 %30, 0, !dbg !115
  %32 = add i64 %31, 0, !dbg !115
  %33 = getelementptr i32, ptr %7, i64 %32, !dbg !115
  %34 = load <8 x i32>, ptr %33, align 4, !dbg !115
  %35 = add i64 %30, 8, !dbg !115
  %36 = add i64 %35, 0, !dbg !115
  %37 = getelementptr i32, ptr %7, i64 %36, !dbg !115
  %38 = load <8 x i32>, ptr %37, align 4, !dbg !115
  %39 = shufflevector <8 x i32> %34, <8 x i32> %34, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, !dbg !115
  %40 = shufflevector <16 x i32> %39, <16 x i32> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>, !dbg !115
  %41 = shufflevector <8 x i32> %38, <8 x i32> %38, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, !dbg !115
  %42 = shufflevector <16 x i32> %41, <16 x i32> %40, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, !dbg !115
  %43 = shufflevector <16 x i32> %42, <16 x i32> %42, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>, !dbg !115
  %44 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 0, i32 1>, !dbg !115
  %45 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 2, i32 3>, !dbg !115
  %46 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 4, i32 5>, !dbg !115
  %47 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 6, i32 7>, !dbg !115
  %48 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 8, i32 9>, !dbg !115
  %49 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 10, i32 11>, !dbg !115
  %50 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 12, i32 13>, !dbg !115
  %51 = shufflevector <16 x i32> %43, <16 x i32> %43, <2 x i32> <i32 14, i32 15>, !dbg !115
  %52 = mul i64 %25, 6, !dbg !115
  %53 = add i64 %52, 0, !dbg !115
  %54 = add i64 %53, 0, !dbg !115
  %55 = getelementptr i32, ptr %14, i64 %54, !dbg !115
  store <2 x i32> %44, ptr %55, align 4, !dbg !115
  %56 = add i64 %52, 2, !dbg !115
  %57 = add i64 %56, 0, !dbg !115
  %58 = getelementptr i32, ptr %14, i64 %57, !dbg !115
  store <2 x i32> %45, ptr %58, align 4, !dbg !115
  %59 = add i64 %52, 4, !dbg !115
  %60 = add i64 %59, 0, !dbg !115
  %61 = getelementptr i32, ptr %14, i64 %60, !dbg !115
  store <2 x i32> %46, ptr %61, align 4, !dbg !115
  %62 = add i64 %52, 6, !dbg !115
  %63 = add i64 %62, 0, !dbg !115
  %64 = getelementptr i32, ptr %14, i64 %63, !dbg !115
  store <2 x i32> %47, ptr %64, align 4, !dbg !115
  %65 = add i64 %52, 8, !dbg !115
  %66 = add i64 %65, 0, !dbg !115
  %67 = getelementptr i32, ptr %14, i64 %66, !dbg !115
  store <2 x i32> %48, ptr %67, align 4, !dbg !115
  %68 = add i64 %52, 10, !dbg !115
  %69 = add i64 %68, 0, !dbg !115
  %70 = getelementptr i32, ptr %14, i64 %69, !dbg !115
  store <2 x i32> %49, ptr %70, align 4, !dbg !115
  %71 = add i64 %52, 12, !dbg !115
  %72 = add i64 %71, 0, !dbg !115
  %73 = getelementptr i32, ptr %14, i64 %72, !dbg !115
  store <2 x i32> %50, ptr %73, align 4, !dbg !115
  %74 = add i64 %52, 14, !dbg !115
  %75 = add i64 %74, 0, !dbg !115
  %76 = getelementptr i32, ptr %14, i64 %75, !dbg !115
  store <2 x i32> %51, ptr %76, align 4, !dbg !115
  %77 = add i64 %25, %23, !dbg !115
  br label %24, !dbg !115

78:                                               ; preds = %24
  ret i32 0, !dbg !116
}

That is storing 64 contiguous bytes: that is wrong. The destination shape is an incomplete tile, so there needs to be some extract_slice codegen.

We know that the 64 bytes are because of the padding. So in the case of padding up, the allocation of the buffer should consider padding as well.

Some key steps from print-ir-after-all. @MaheshRavishankar , @hanhanW , at which step is this going wrong? The wrong behavior we are seeing here is that 8x2xi32 is being written where only 3x2xi32 should be. What are the semantics of the

%7 = vector.transfer_write %5, %6[%c0, %c0] {in_bounds = [true, true]} : vector<8x2xi32>, tensor<3x2xi32>

that is introduced at OptimizeTensorInsertExtractSlices? Does it have extract-slice semantics to correctly avoid writing more than 3x2, or is that inherently ill-formed? Next, same question about the

    vector.store %14, %subview_2[%c7, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>

that is created by LLVMCPUVectorTransferLowering? If that is still correct (and should be compiled as a no-op because 7 is out of bounds in 3x2) then it's down to a LLVM vector lowering bug, and that might explain why we see this only on arm64?

// -----// IR Dump After GenericVectorization (iree-codegen-generic-vectorization) //----- //
func.func @_batch_matmul_narrow_n_2_dispatch_4_unpack_i32() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
  %c0_i32 = arith.constant 0 : i32
  %c2 = arith.constant 2 : index
  %c128 = arith.constant 128 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c128) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2x1x1x2x8xi32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  scf.for %arg0 = %workgroup_id_x to %c2 step %workgroup_count_x {
    %2 = flow.dispatch.tensor.load %1, offsets = [%arg0, 0, 0], sizes = [1, 3, 2], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>> -> tensor<1x3x2xi32>
    %3 = flow.dispatch.tensor.load %0, offsets = [%arg0, 0, 0, 0, 0], sizes = [1, 1, 1, 2, 8], strides = [1, 1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x1x1x2x8xi32>> -> tensor<1x1x1x2x8xi32>
    %extracted_slice = tensor.extract_slice %3[0, 0, 0, 0, 0] [1, 1, 1, 2, 8] [1, 1, 1, 1, 1] : tensor<1x1x1x2x8xi32> to tensor<2x8xi32>
    %4 = tensor.empty() : tensor<8x2xi32>
    %5 = vector.transfer_read %extracted_slice[%c0, %c0], %c0_i32 {in_bounds = [true, true]} : tensor<2x8xi32>, vector<2x8xi32>
    %6 = vector.transpose %5, [1, 0] : vector<2x8xi32> to vector<8x2xi32>
    %7 = vector.transfer_write %6, %4[%c0, %c0] {in_bounds = [true, true]} : vector<8x2xi32>, tensor<8x2xi32>
    %extracted_slice_0 = tensor.extract_slice %7[0, 0] [3, 2] [1, 1] : tensor<8x2xi32> to tensor<3x2xi32>
    %inserted_slice = tensor.insert_slice %extracted_slice_0 into %2[0, 0, 0] [1, 3, 2] [1, 1, 1] : tensor<3x2xi32> into tensor<1x3x2xi32>
    flow.dispatch.tensor.store %inserted_slice, %1, offsets = [%arg0, 0, 0], sizes = [1, 3, 2], strides = [1, 1, 1] : tensor<1x3x2xi32> -> !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>>
  }
  return
}

// -----// IR Dump After OptimizeTensorInsertExtractSlices (iree-codegen-optimize-tensor-insert-extract-slices) //----- //
func.func @_batch_matmul_narrow_n_2_dispatch_4_unpack_i32() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
  %c0_i32 = arith.constant 0 : i32
  %c2 = arith.constant 2 : index
  %c128 = arith.constant 128 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c128) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2x1x1x2x8xi32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  scf.for %arg0 = %workgroup_id_x to %c2 step %workgroup_count_x {
    %2 = flow.dispatch.tensor.load %1, offsets = [%arg0, 0, 0], sizes = [1, 3, 2], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>> -> tensor<1x3x2xi32>
    %3 = flow.dispatch.tensor.load %0, offsets = [%arg0, 0, 0, 0, 0], sizes = [1, 1, 1, 2, 8], strides = [1, 1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x1x1x2x8xi32>> -> tensor<1x1x1x2x8xi32>
    %4 = vector.transfer_read %3[%c0, %c0, %c0, %c0, %c0], %c0_i32 {in_bounds = [true, true]} : tensor<1x1x1x2x8xi32>, vector<2x8xi32>
    %5 = vector.transpose %4, [1, 0] : vector<2x8xi32> to vector<8x2xi32>
    %6 = tensor.empty() : tensor<3x2xi32>
    %7 = vector.transfer_write %5, %6[%c0, %c0] {in_bounds = [true, true]} : vector<8x2xi32>, tensor<3x2xi32>
    %inserted_slice = tensor.insert_slice %7 into %2[0, 0, 0] [1, 3, 2] [1, 1, 1] : tensor<3x2xi32> into tensor<1x3x2xi32>
    flow.dispatch.tensor.store %inserted_slice, %1, offsets = [%arg0, 0, 0], sizes = [1, 3, 2], strides = [1, 1, 1] : tensor<1x3x2xi32> -> !flow.dispatch.tensor<writeonly:tensor<2x3x2xi32>>
  }
  return
}

// -----// IR Dump After LLVMCPUVectorTransferLowering (iree-llvmcpu-vector-transfer-lowering) //----- //
func.func @_batch_matmul_narrow_n_2_dispatch_4_unpack_i32() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
  %c7 = arith.constant 7 : index
  %c6 = arith.constant 6 : index
  %c5 = arith.constant 5 : index
  %c4 = arith.constant 4 : index
  %c3 = arith.constant 3 : index
  %c1 = arith.constant 1 : index
  %cst = arith.constant dense<0> : vector<2x8xi32>
  %c2 = arith.constant 2 : index
  %c128 = arith.constant 128 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c128) flags(ReadOnly) : memref<2x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: 32>, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %0, 64 : memref<2x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: 32>, #hal.descriptor_type<storage_buffer>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : memref<2x3x2xi32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %1, 64 : memref<2x3x2xi32, #hal.descriptor_type<storage_buffer>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  scf.for %arg0 = %workgroup_id_x to %c2 step %workgroup_count_x {
    %subview = memref.subview %1[%arg0, 0, 0] [1, 3, 2] [1, 1, 1] : memref<2x3x2xi32, #hal.descriptor_type<storage_buffer>> to memref<1x3x2xi32, strided<[6, 2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    %subview_0 = memref.subview %0[%arg0, 0, 0, 0, 0] [1, 1, 1, 2, 8] [1, 1, 1, 1, 1] : memref<2x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: 32>, #hal.descriptor_type<storage_buffer>> to memref<1x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    %subview_1 = memref.subview %subview_0[0, 0, 0, 0, 0] [1, 1, 1, 2, 8] [1, 1, 1, 1, 1] : memref<1x1x1x2x8xi32, strided<[16, 16, 16, 8, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<2x8xi32, affine_map<(d0, d1)[s0] -> (d0 * 8 + d1 + s0)>, #hal.descriptor_type<storage_buffer>>
    %2 = vector.load %subview_1[%c0, %c0] : memref<2x8xi32, affine_map<(d0, d1)[s0] -> (d0 * 8 + d1 + s0)>, #hal.descriptor_type<storage_buffer>>, vector<8xi32>
    %3 = vector.insert %2, %cst [0] : vector<8xi32> into vector<2x8xi32>
    %4 = vector.load %subview_1[%c1, %c0] : memref<2x8xi32, affine_map<(d0, d1)[s0] -> (d0 * 8 + d1 + s0)>, #hal.descriptor_type<storage_buffer>>, vector<8xi32>
    %5 = vector.insert %4, %3 [1] : vector<8xi32> into vector<2x8xi32>
    %6 = vector.transpose %5, [1, 0] : vector<2x8xi32> to vector<8x2xi32>
    %subview_2 = memref.subview %subview[0, 0, 0] [1, 3, 2] [1, 1, 1] : memref<1x3x2xi32, strided<[6, 2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
    %7 = vector.extract %6[0] : vector<2xi32> from vector<8x2xi32>
    vector.store %7, %subview_2[%c0, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
    %8 = vector.extract %6[1] : vector<2xi32> from vector<8x2xi32>
    vector.store %8, %subview_2[%c1, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
    %9 = vector.extract %6[2] : vector<2xi32> from vector<8x2xi32>
    vector.store %9, %subview_2[%c2, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
    %10 = vector.extract %6[3] : vector<2xi32> from vector<8x2xi32>
    vector.store %10, %subview_2[%c3, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
    %11 = vector.extract %6[4] : vector<2xi32> from vector<8x2xi32>
    vector.store %11, %subview_2[%c4, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
    %12 = vector.extract %6[5] : vector<2xi32> from vector<8x2xi32>
    vector.store %12, %subview_2[%c5, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
    %13 = vector.extract %6[6] : vector<2xi32> from vector<8x2xi32>
    vector.store %13, %subview_2[%c6, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
    %14 = vector.extract %6[7] : vector<2xi32> from vector<8x2xi32>
    vector.store %14, %subview_2[%c7, %c0] : memref<3x2xi32, strided<[2, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<2xi32>
  }
  return
}

Note: if I drop OptimizeTensorInsertExtractSlicesPass from the LLVMCPU pipeline, it succeeds.

nice investigation benoit!

We know that the 64 bytes are because of the padding. So in the case of padding up, the allocation of the buffer should consider padding as well.

this is the unset_encoding, which should be going from padded to unpadded, so it's the opposite case (IIUC)

(unrelated to this issue but it does highlight it: dispatching two workgroups to handle 64 bytes each is absurd, so this may be something to keep in mind as we try to make these more efficient - there's a lot of instructions between workgroup invocations including potential thread hops, so we want to be shooting for processing 1000-100000 elements and rarely ever <1000 - 128k-256k for parallel memcpy is kind of the minimum)

The change from

    %7 = vector.transfer_write %6, %4[%c0, %c0] {in_bounds = [true, true]} : vector<8x2xi32>, tensor<8x2xi32>
    %extracted_slice_0 = tensor.extract_slice %7[0, 0] [3, 2] [1, 1] : tensor<8x2xi32> to tensor<3x2xi32>
    %inserted_slice = tensor.insert_slice %extracted_slice_0 into %2[0, 0, 0] [1, 3, 2] [1, 1, 1] : tensor<3x2xi32> into tensor<1x3x2xi32>

to

    %7 = vector.transfer_write %5, %6[%c0, %c0] {in_bounds = [true, true]} : vector<8x2xi32>, tensor<3x2xi32>
    %inserted_slice = tensor.insert_slice %7 into %2[0, 0, 0] [1, 3, 2] [1, 1, 1] : tensor<3x2xi32> into tensor<1x3x2xi32>

should at the very least change the in_bounds = [true, true] to in_bounds = [false, true] . Without that the IR is wrong.... Not sure that is the final solution though.

It seems to me like the issue is that the transfer write is still in bounds after OptimizeTensorInsertExtractSlices. Before that it looks fine to me, so I'm guessing some pattern used by OptimizeTensorInsertExtractSlices is failing to update the in_bounds attribute.

This write is clearly the problem:

%7 = vector.transfer_write %5, %6[%c0, %c0] {in_bounds = [true, true]} : vector<8x2xi32>, tensor<3x2xi32>

I guess we can make that a masked write, but it should be out of bounds at least.

Looks like the pattern responsible for this rewrite is conservatively updating the in_bounds atrribute:
https://github.com/llvm/llvm-project/blob/7b6a89f346f281e5b7caa593a8c484eaf4264055/mlir/lib/Dialect/Vector/IR/VectorOps.cpp#L4648-L4649

Based on the comment there, I guess this is an issue with the transfer_write folder:
https://github.com/llvm/llvm-project/blob/7b6a89f346f281e5b7caa593a8c484eaf4264055/mlir/lib/Dialect/Vector/IR/VectorOps.cpp#L4479-L4490

Looking at the IR more fixing the in bounds attribute should fix the error, but needs care to not pessimize too much

The additional action item could be improving transfer_read/write's verifier. If there are no masks and the shape is static, we should check the size and in_bounds attributes.

As pointed out by @MaheshRavishankar and @hanhanW , can confirm the problem is caused by this line:

Bug fixed in #17563