error: 'memref.alloca' op expected no stack allocations with dynamic shapes
silvasean opened this issue · comments
Describe the bug
core-input.mlir:18:12: error: 'memref.alloca' op expected no stack allocations with dynamic shapes
%6:2 = linalg.generic {indexing_maps = [#map0, #map1, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0 : tensor<?x?x?xf32>) outs(%5, %3 : tensor<?x?x1xf32>, tensor<?x?x1xi64>) {
^
Full error log: https://gist.github.com/f499dc448652054f9eae68f8dfbc1489
To Reproduce
iree-compile --iree-hal-target-backends=dylib repro.mlir
#map0 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, 0)>
module attributes {torch.debug_module_name = "SoftmaxIntModule"} {
func @forward(%arg0: tensor<?x?x?xf32>) -> tensor<?x?x?xf32> {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%cst = arith.constant 0.000000e+00 : f32
%cst_0 = arith.constant 1.000000e+00 : f64
%cst_1 = arith.constant -3.40282347E+38 : f32
%c0_i64 = arith.constant 0 : i64
%0 = tensor.dim %arg0, %c0 : tensor<?x?x?xf32>
%1 = tensor.dim %arg0, %c1 : tensor<?x?x?xf32>
%2 = linalg.init_tensor [%0, %1, 1] : tensor<?x?x1xi64>
%3 = linalg.fill ins(%c0_i64 : i64) outs(%2 : tensor<?x?x1xi64>) -> tensor<?x?x1xi64>
%4 = linalg.init_tensor [%0, %1, 1] : tensor<?x?x1xf32>
%5 = linalg.fill ins(%cst_1 : f32) outs(%4 : tensor<?x?x1xf32>) -> tensor<?x?x1xf32>
%6:2 = linalg.generic {indexing_maps = [#map0, #map1, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%arg0 : tensor<?x?x?xf32>) outs(%5, %3 : tensor<?x?x1xf32>, tensor<?x?x1xi64>) {
^bb0(%arg1: f32, %arg2: f32, %arg3: i64):
%16 = linalg.index 2 : index
%17 = arith.index_cast %16 : index to i64
%18 = arith.cmpf ogt, %arg1, %arg2 : f32
%19 = arith.select %18, %arg1, %arg2 : f32
%20 = arith.select %18, %17, %arg3 : i64
linalg.yield %19, %20 : f32, i64
} -> (tensor<?x?x1xf32>, tensor<?x?x1xi64>)
%7 = tensor.dim %arg0, %c2 : tensor<?x?x?xf32>
%8 = arith.cmpi eq, %0, %0 : index
cf.assert %8, "mismatched size for broadcast"
%9 = arith.cmpi eq, %1, %1 : index
cf.assert %9, "mismatched size for broadcast"
%10 = linalg.init_tensor [%0, %1, %7] : tensor<?x?x?xf32>
%11 = linalg.generic {indexing_maps = [#map0, #map1, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg0, %6#0 : tensor<?x?x?xf32>, tensor<?x?x1xf32>) outs(%10 : tensor<?x?x?xf32>) {
^bb0(%arg1: f32, %arg2: f32, %arg3: f32):
%16 = arith.truncf %cst_0 : f64 to f32
%17 = arith.mulf %arg2, %16 : f32
%18 = arith.subf %arg1, %17 : f32
linalg.yield %18 : f32
} -> tensor<?x?x?xf32>
%12 = linalg.generic {indexing_maps = [#map0, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%11 : tensor<?x?x?xf32>) outs(%10 : tensor<?x?x?xf32>) {
^bb0(%arg1: f32, %arg2: f32):
%16 = math.exp %arg1 : f32
linalg.yield %16 : f32
} -> tensor<?x?x?xf32>
%13 = linalg.fill ins(%cst : f32) outs(%4 : tensor<?x?x1xf32>) -> tensor<?x?x1xf32>
%14 = linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel", "reduction"]} ins(%12 : tensor<?x?x?xf32>) outs(%13 : tensor<?x?x1xf32>) {
^bb0(%arg1: f32, %arg2: f32):
%16 = arith.addf %arg1, %arg2 : f32
linalg.yield %16 : f32
} -> tensor<?x?x1xf32>
cf.assert %8, "mismatched size for broadcast"
cf.assert %9, "mismatched size for broadcast"
%15 = linalg.generic {indexing_maps = [#map0, #map1, #map0], iterator_types = ["parallel", "parallel", "parallel"]} ins(%12, %14 : tensor<?x?x?xf32>, tensor<?x?x1xf32>) outs(%10 : tensor<?x?x?xf32>) {
^bb0(%arg1: f32, %arg2: f32, %arg3: f32):
%16 = arith.divf %arg1, %arg2 : f32
linalg.yield %16 : f32
} -> tensor<?x?x?xf32>
return %15 : tensor<?x?x?xf32>
}
}
Ugh, more stack allocations.
As @MaheshRavishankar predicted, it seems like the Torch-MLIR test suite covers the dynamic shape cases in a way that our existing testing does not. So it would be quite useful to run these tests presubmit -- even just running iree-compile over a snapshot of .mlir files from the test suite should be sufficient to smoke out most of these issues. That should hopefully be able to run very quickly (~2 minutes) in CI.
Agreed. We have a reasonable body of such tests from xla, because it has been there for a while.
My only thought in doing this is that such tests are not guaranteed ir stable and may need manual fixups. We should document how to regenerate them, as in some extreme events, it may not be feasible to manually fix up.
hmm, I enabled the check this week. We accept small stack allocation, but not unknown size stack allocation. I'll take a look at this case. In the meantime, you can compile with -iree-codegen-check-ir-before-llvm-conversion=false
flag if you want to test correctness. This is the flag to bypass the check.
This is the same issue about #8592
We address the issue for ARM codegen for static shapes, which reduce the stack allocation to 16 KB. I have an improved PR to infer the upper bound of dynamic sizes.