High compilation time is spent on the Canonicalizer for large models
GeorgeARM opened this issue · comments
Have been exploring IREE compilation time over a range of models of varying size and complexity.
Noticed high compilation time on "heavy" models with large const data.
Extracting some timing information outlined that the Canonicalizer
after TosaToLinalgNamed
to consume unreasonably large portion of the execution time.
An example timing output on inception v3.
===-------------------------------------------------------------------------===
... Execution time report ...
===-------------------------------------------------------------------------===
Total Execution Time: 38.0063 seconds
----User Time---- ----Wall Time---- ----Name----
... skipped lines ...
34.6385 ( 54.2%) 34.6385 ( 91.1%) 'func.func' Pipeline
0.0007 ( 0.0%) 0.0007 ( 0.0%) TosaMakeBroadcastable
0.0003 ( 0.0%) 0.0003 ( 0.0%) TosaToArith
0.0002 ( 0.0%) 0.0002 ( 0.0%) TosaToTensor
0.0007 ( 0.0%) 0.0007 ( 0.0%) Canonicalizer
0.0010 ( 0.0%) 0.0010 ( 0.0%) TosaOptionalDecompositions
0.0279 ( 0.0%) 0.0279 ( 0.1%) Canonicalizer
0.0008 ( 0.0%) 0.0008 ( 0.0%) TosaMakeBroadcastable
0.0114 ( 0.0%) 0.0114 ( 0.0%) TosaToLinalgNamed
34.5102 ( 54.0%) 34.5102 ( 90.8%) Canonicalizer
... skipped lines ...
Printing the IR before and after highlights injection of Transpose
operations on the constant weights of Conv2d
to bring the TOSA
weights layout of FHWC
to be LinAlg
compatible HWCF
. Something that can be noted here as well.
e.g.
%cst_2 = arith.constant dense<[1, 2, 3, 0]> : tensor<4xi64>
%2 = "tosa.transpose"(%cst, %cst_2) : (tensor<64x3x3x32xf32>, tensor<4xi64>) -> tensor<3x3x32x64xf32>
Profiling with callgrind
seems to reveal the issue being in the ConstantTransposeOptimization
here which is marked as a canonicalisation pattern.
Overall, not sure if something like this should be part of the Canonicalizer in the first place for a variety of reasons.
Suppose this issue needs to be migrated/moved to LLVM repo itself ?!
Steps to reproduce
- Download https://tfhub.dev/tensorflow/lite-model/inception_v3/1/default/1 for example
- Build IREE like:
cmake -G Ninja .. \
-DCMAKE_INSTALL_PREFIX=./install \
-DCMAKE_BUILD_TYPE=Release \
-DIREE_ENABLE_ASSERTIONS=ON \
-DIREE_BUILD_COMPILER=ON \
-DIREE_BUILD_TESTS=OFF \
-DIREE_BUILD_BENCHMARKS=OFF \
-DIREE_BUILD_SAMPLES=OFF
cmake --build . --target install -- -k 0
- Convert model to MLIR:
iree-import-tflite inception_v3_1_default_1.tflite -o inception_v3.mlir
- Compile model and extract timings
./iree-translate --iree-mlir-to-vm-bytecode-module --iree-input-type=tosa --iree-hal-target-backends=vulkan-spirv --iree-vulkan-target-triple=valhall-unknown-android11 inception_v3.mlir -o inception_v3.mali-target.vmfb --mlir-timing
Oh, my bad. I added that pattern previously with a naive implementation. Later I introduced a similar pattern in Linalg and improved it with better implementation. But never gotten back to improve the TOSA one. I guess the pattern in TOSA can be deleted now given at Linalg level we can fold it. Or update it like the way it is written in Linalg to improve it.
Thanks for your prompt response @antiagainst.
Yes seems sensible to either completely remove or move outside the canonicalize, rework as per how linalg is written and put it on its own pass.
Will have a look and upstream a fix. Should I leave this open until fix is merged?
SGTM, thanks!
https://reviews.llvm.org/D124685 is landed. Thanks @GeorgeARM! Closing this. Please reopen if you still see issues afterwards.