error: 'memref.alloca' op expected no stack allocations with dynamic shapes

Question

error: 'memref.alloca' op expected no stack allocations with dynamic shapes

silvasean opened this issue 2 years ago · comments

Describe the bug

core-input.mlir:18:12: error: 'memref.alloca' op expected no stack allocations with dynamic shapes
    %6:2 = linalg.generic {indexing_maps = [#map0, #map1, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0 : tensor<?x?x?xf32>) outs(%5, %3 : tensor<?x?x1xf32>, tensor<?x?x1xi64>) {
           ^

Full error log: https://gist.github.com/f499dc448652054f9eae68f8dfbc1489

To Reproduce

iree-compile --iree-hal-target-backends=dylib repro.mlir

#map0 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, 0)>
module attributes {torch.debug_module_name = "SoftmaxIntModule"} {
  func @forward(%arg0: tensor<?x?x?xf32>) -> tensor<?x?x?xf32> {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c2 = arith.constant 2 : index
    %cst = arith.constant 0.000000e+00 : f32
    %cst_0 = arith.constant 1.000000e+00 : f64
    %cst_1 = arith.constant -3.40282347E+38 : f32
    %c0_i64 = arith.constant 0 : i64
    %0 = tensor.dim %arg0, %c0 : tensor<?x?x?xf32>
    %1 = tensor.dim %arg0, %c1 : tensor<?x?x?xf32>
    %2 = linalg.init_tensor [%0, %1, 1] : tensor<?x?x1xi64>
    %3 = linalg.fill ins(%c0_i64 : i64) outs(%2 : tensor<?x?x1xi64>) -> tensor<?x?x1xi64>
    %4 = linalg.init_tensor [%0, %1, 1] : tensor<?x?x1xf32>
    %5 = linalg.fill ins(%cst_1 : f32) outs(%4 : tensor<?x?x1xf32>) -> tensor<?x?x1xf32>
    %6:2 = linalg.generic {indexing_maps = [#map0, #map1, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0 : tensor<?x?x?xf32>) outs(%5, %3 : tensor<?x?x1xf32>, tensor<?x?x1xi64>) {
    ^bb0(%arg1: f32, %arg2: f32, %arg3: i64):
      %16 = linalg.index 2 : index
      %17 = arith.index_cast %16 : index to i64
      %18 = arith.cmpf ogt, %arg1, %arg2 : f32
      %19 = arith.select %18, %arg1, %arg2 : f32
      %20 = arith.select %18, %17, %arg3 : i64
      linalg.yield %19, %20 : f32, i64
    } -> (tensor<?x?x1xf32>, tensor<?x?x1xi64>)
    %7 = tensor.dim %arg0, %c2 : tensor<?x?x?xf32>
    %8 = arith.cmpi eq, %0, %0 : index
    cf.assert %8, "mismatched size for broadcast"
    %9 = arith.cmpi eq, %1, %1 : index
    cf.assert %9, "mismatched size for broadcast"
    %10 = linalg.init_tensor [%0, %1, %7] : tensor<?x?x?xf32>
    %11 = linalg.generic {indexing_maps = [#map0, #map1, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg0, %6#0 : tensor<?x?x?xf32>, tensor<?x?x1xf32>) outs(%10 : tensor<?x?x?xf32>) {
    ^bb0(%arg1: f32, %arg2: f32, %arg3: f32):
      %16 = arith.truncf %cst_0 : f64 to f32
      %17 = arith.mulf %arg2, %16 : f32
      %18 = arith.subf %arg1, %17 : f32
      linalg.yield %18 : f32
    } -> tensor<?x?x?xf32>
    %12 = linalg.generic {indexing_maps = [#map0, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%11 : tensor<?x?x?xf32>) outs(%10 : tensor<?x?x?xf32>) {
    ^bb0(%arg1: f32, %arg2: f32):
      %16 = math.exp %arg1 : f32
      linalg.yield %16 : f32
    } -> tensor<?x?x?xf32>
    %13 = linalg.fill ins(%cst : f32) outs(%4 : tensor<?x?x1xf32>) -> tensor<?x?x1xf32>
    %14 = linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%12 : tensor<?x?x?xf32>) outs(%13 : tensor<?x?x1xf32>) {
    ^bb0(%arg1: f32, %arg2: f32):
      %16 = arith.addf %arg1, %arg2 : f32
      linalg.yield %16 : f32
    } -> tensor<?x?x1xf32>
    cf.assert %8, "mismatched size for broadcast"
    cf.assert %9, "mismatched size for broadcast"
    %15 = linalg.generic {indexing_maps = [#map0, #map1, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%12, %14 : tensor<?x?x?xf32>, tensor<?x?x1xf32>) outs(%10 : tensor<?x?x?xf32>) {
    ^bb0(%arg1: f32, %arg2: f32, %arg3: f32):
      %16 = arith.divf %arg1, %arg2 : f32
      linalg.yield %16 : f32
    } -> tensor<?x?x?xf32>
    return %15 : tensor<?x?x?xf32>
  }
}

Stella Laurenzo · Answer 1 · Thu Apr 28 2022 23:25:10 GMT+0800 (China Standard Time)

Ugh, more stack allocations.

Sean Silva · Answer 2 · Thu Apr 28 2022 23:28:43 GMT+0800 (China Standard Time)

As @MaheshRavishankar predicted, it seems like the Torch-MLIR test suite covers the dynamic shape cases in a way that our existing testing does not. So it would be quite useful to run these tests presubmit -- even just running iree-compile over a snapshot of .mlir files from the test suite should be sufficient to smoke out most of these issues. That should hopefully be able to run very quickly (~2 minutes) in CI.

Stella Laurenzo · Answer 3 · Thu Apr 28 2022 23:32:37 GMT+0800 (China Standard Time)

Agreed. We have a reasonable body of such tests from xla, because it has been there for a while.

My only thought in doing this is that such tests are not guaranteed ir stable and may need manual fixups. We should document how to regenerate them, as in some extreme events, it may not be feasible to manually fix up.

Han-Chung Wang · Answer 4 · Fri Apr 29 2022 10:50:09 GMT+0800 (China Standard Time)

hmm, I enabled the check this week. We accept small stack allocation, but not unknown size stack allocation. I'll take a look at this case. In the meantime, you can compile with -iree-codegen-check-ir-before-llvm-conversion=false flag if you want to test correctness. This is the flag to bypass the check.

Han-Chung Wang · Answer 5 · Fri Apr 29 2022 14:26:46 GMT+0800 (China Standard Time)

This is the same issue about #8592

We address the issue for ARM codegen for static shapes, which reduce the stack allocation to 16 KB. I have an improved PR to infer the upper bound of dynamic sizes.