ROCm / rocMLIR

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convolution hang/crash

pfultz2 opened this issue · comments

This is trace from the crash that includes the mlir program and tuning problem and solution keys:

Problem: gfx90a:sramecc+:xnack- 110     conv -F 1 -f GNCHW -I HWNGC -O NGCHW -n 1 -c 512 -H 256 -W 256 -k 512 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -g 1
Benchmarking solution: 128,256,4,64,128,4,1,1
module {
  func.func @mlir_transpose_convolution_add(%arg0: !migraphx.shaped<1x512x256x256xf32, 0x1x0x0>, %arg1: !migraphx.shaped<256x256x1x512xf32, 131072x512x512x1>, %arg2: !migraphx.shaped<512x512x3x3xf32, 4608x9x3x1>) -> !migraphx.shaped<1x512x256x256xf32, 33554432x65536x256x1> attributes {arch = "", kernel = "mixr", num_cu = 0 : i64} {
    %0 = migraphx.transpose %arg1 {permutation = [2, 3, 0, 1]} : <256x256x1x512xf32, 131072x512x512x1> -> <1x512x256x256xf32, 512x1x131072x512>
    %1 = migraphx.convolution %0, %arg2 {dilation = [1, 1], group = 1 : i64, padding = [1, 1, 1, 1], padding_mode = 0 : i64, stride = [1, 1]} : <1x512x256x256xf32, 512x1x131072x512>, <512x512x3x3xf32, 4608x9x3x1> -> <1x512x256x256xf32, 33554432x65536x256x1>
    %2 = migraphx.add %1, %arg0 : <1x512x256x256xf32, 33554432x65536x256x1>, <1x512x256x256xf32, 0x1x0x0> -> <1x512x256x256xf32, 33554432x65536x256x1>
    return %2 : !migraphx.shaped<1x512x256x256xf32, 33554432x65536x256x1>
  }
}

DoD

  • Need to be backported to ROCm 6.1

I confirm reproducability of a runtime memory fault on a gfx90a system.

(I also observe this is a fairly weird convolution where the input layout's something like HWNC but we've got a KCYX convolution, so we're hitting some very rarely exercised bits of code ... but why runtime memory fault, I haven't a clue)

Note to the future, disabling collapseContiguousMerges() (that is, the following diff

diff --git a/mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp b/mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp
index 9344904be788..948ddc456760 100644
--- a/mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp
+++ b/mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp
@@ -815,6 +815,7 @@ int64_t mlir::rock::getMaxVectorizationForDatatype(

 ArrayAttr mlir::rock::collapseContiguousMerges(ArrayAttr transforms,
                                                ArrayRef<int64_t> outputShape) {
+  return transforms;
   ContiguousMergesMap contigousMerges =
       findContiguousGroups(transforms, outputShape);
   SmallVector<Attribute> newTransformMaps;

is a resolution. I wouldn't use it as The Solution unless debugging fails massively, but at least we've got a small set of functions to scrutinize for incorrect behavior under edge cases.

Further update (patch is cooking, pending adding a test for this case):

The general shape of the buggy scenario looks like H, W, N = Merge{256, 256, 1}(gemmN, K = gemmM; permute (HKWN) to (NKHW); where the "hey, are we dividing things up only to multiply them back together again" code gets too optimistic about roping in unit dimensions that aren't actually in the middle of the contiguous region, leading to that first Merge being incorrectly rewritten to Merge{1, 1, 65536}, which then launches array accesses into the stratosphere. The correct rewrite, post-patch, is going to Merge{1, 65536, 1}, which does what we were wanting this whole time.