iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

Home Page:http://iree.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error: 'memref.alloca' op expected no stack allocations with dynamic shapes

silvasean opened this issue · comments

Describe the bug

core-input.mlir:18:12: error: 'memref.alloca' op expected no stack allocations with dynamic shapes
    %6:2 = linalg.generic {indexing_maps = [#map0, #map1, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0 : tensor<?x?x?xf32>) outs(%5, %3 : tensor<?x?x1xf32>, tensor<?x?x1xi64>) {
           ^

Full error log: https://gist.github.com/f499dc448652054f9eae68f8dfbc1489

To Reproduce

iree-compile --iree-hal-target-backends=dylib repro.mlir
#map0 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, 0)>
module attributes {torch.debug_module_name = "SoftmaxIntModule"} {
  func @forward(%arg0: tensor<?x?x?xf32>) -> tensor<?x?x?xf32> {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c2 = arith.constant 2 : index
    %cst = arith.constant 0.000000e+00 : f32
    %cst_0 = arith.constant 1.000000e+00 : f64
    %cst_1 = arith.constant -3.40282347E+38 : f32
    %c0_i64 = arith.constant 0 : i64
    %0 = tensor.dim %arg0, %c0 : tensor<?x?x?xf32>
    %1 = tensor.dim %arg0, %c1 : tensor<?x?x?xf32>
    %2 = linalg.init_tensor [%0, %1, 1] : tensor<?x?x1xi64>
    %3 = linalg.fill ins(%c0_i64 : i64) outs(%2 : tensor<?x?x1xi64>) -> tensor<?x?x1xi64>
    %4 = linalg.init_tensor [%0, %1, 1] : tensor<?x?x1xf32>
    %5 = linalg.fill ins(%cst_1 : f32) outs(%4 : tensor<?x?x1xf32>) -> tensor<?x?x1xf32>
    %6:2 = linalg.generic {indexing_maps = [#map0, #map1, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0 : tensor<?x?x?xf32>) outs(%5, %3 : tensor<?x?x1xf32>, tensor<?x?x1xi64>) {
    ^bb0(%arg1: f32, %arg2: f32, %arg3: i64):
      %16 = linalg.index 2 : index
      %17 = arith.index_cast %16 : index to i64
      %18 = arith.cmpf ogt, %arg1, %arg2 : f32
      %19 = arith.select %18, %arg1, %arg2 : f32
      %20 = arith.select %18, %17, %arg3 : i64
      linalg.yield %19, %20 : f32, i64
    } -> (tensor<?x?x1xf32>, tensor<?x?x1xi64>)
    %7 = tensor.dim %arg0, %c2 : tensor<?x?x?xf32>
    %8 = arith.cmpi eq, %0, %0 : index
    cf.assert %8, "mismatched size for broadcast"
    %9 = arith.cmpi eq, %1, %1 : index
    cf.assert %9, "mismatched size for broadcast"
    %10 = linalg.init_tensor [%0, %1, %7] : tensor<?x?x?xf32>
    %11 = linalg.generic {indexing_maps = [#map0, #map1, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg0, %6#0 : tensor<?x?x?xf32>, tensor<?x?x1xf32>) outs(%10 : tensor<?x?x?xf32>) {
    ^bb0(%arg1: f32, %arg2: f32, %arg3: f32):
      %16 = arith.truncf %cst_0 : f64 to f32
      %17 = arith.mulf %arg2, %16 : f32
      %18 = arith.subf %arg1, %17 : f32
      linalg.yield %18 : f32
    } -> tensor<?x?x?xf32>
    %12 = linalg.generic {indexing_maps = [#map0, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%11 : tensor<?x?x?xf32>) outs(%10 : tensor<?x?x?xf32>) {
    ^bb0(%arg1: f32, %arg2: f32):
      %16 = math.exp %arg1 : f32
      linalg.yield %16 : f32
    } -> tensor<?x?x?xf32>
    %13 = linalg.fill ins(%cst : f32) outs(%4 : tensor<?x?x1xf32>) -> tensor<?x?x1xf32>
    %14 = linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%12 : tensor<?x?x?xf32>) outs(%13 : tensor<?x?x1xf32>) {
    ^bb0(%arg1: f32, %arg2: f32):
      %16 = arith.addf %arg1, %arg2 : f32
      linalg.yield %16 : f32
    } -> tensor<?x?x1xf32>
    cf.assert %8, "mismatched size for broadcast"
    cf.assert %9, "mismatched size for broadcast"
    %15 = linalg.generic {indexing_maps = [#map0, #map1, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%12, %14 : tensor<?x?x?xf32>, tensor<?x?x1xf32>) outs(%10 : tensor<?x?x?xf32>) {
    ^bb0(%arg1: f32, %arg2: f32, %arg3: f32):
      %16 = arith.divf %arg1, %arg2 : f32
      linalg.yield %16 : f32
    } -> tensor<?x?x?xf32>
    return %15 : tensor<?x?x?xf32>
  }
}

Ugh, more stack allocations.

As @MaheshRavishankar predicted, it seems like the Torch-MLIR test suite covers the dynamic shape cases in a way that our existing testing does not. So it would be quite useful to run these tests presubmit -- even just running iree-compile over a snapshot of .mlir files from the test suite should be sufficient to smoke out most of these issues. That should hopefully be able to run very quickly (~2 minutes) in CI.

Agreed. We have a reasonable body of such tests from xla, because it has been there for a while.

My only thought in doing this is that such tests are not guaranteed ir stable and may need manual fixups. We should document how to regenerate them, as in some extreme events, it may not be feasible to manually fix up.

hmm, I enabled the check this week. We accept small stack allocation, but not unknown size stack allocation. I'll take a look at this case. In the meantime, you can compile with -iree-codegen-check-ir-before-llvm-conversion=false flag if you want to test correctness. This is the flag to bypass the check.

This is the same issue about #8592

We address the issue for ARM codegen for static shapes, which reduce the stack allocation to 16 KB. I have an improved PR to infer the upper bound of dynamic sizes.