iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

Home Page:http://iree.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High compilation time is spent on the Canonicalizer for large models

GeorgeARM opened this issue · comments

Have been exploring IREE compilation time over a range of models of varying size and complexity.
Noticed high compilation time on "heavy" models with large const data.
Extracting some timing information outlined that the Canonicalizer after TosaToLinalgNamed to consume unreasonably large portion of the execution time.

An example timing output on inception v3.

===-------------------------------------------------------------------------===
                         ... Execution time report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 38.0063 seconds

  ----User Time----  ----Wall Time----  ----Name----
                             ... skipped lines ...
   34.6385 ( 54.2%)   34.6385 ( 91.1%)  'func.func' Pipeline
    0.0007 (  0.0%)    0.0007 (  0.0%)    TosaMakeBroadcastable
    0.0003 (  0.0%)    0.0003 (  0.0%)    TosaToArith
    0.0002 (  0.0%)    0.0002 (  0.0%)    TosaToTensor
    0.0007 (  0.0%)    0.0007 (  0.0%)    Canonicalizer
    0.0010 (  0.0%)    0.0010 (  0.0%)    TosaOptionalDecompositions
    0.0279 (  0.0%)    0.0279 (  0.1%)    Canonicalizer
    0.0008 (  0.0%)    0.0008 (  0.0%)    TosaMakeBroadcastable
    0.0114 (  0.0%)    0.0114 (  0.0%)    TosaToLinalgNamed
   34.5102 ( 54.0%)   34.5102 ( 90.8%)    Canonicalizer
                             ... skipped lines ...

Printing the IR before and after highlights injection of Transpose operations on the constant weights of Conv2d to bring the TOSA weights layout of FHWC to be LinAlg compatible HWCF. Something that can be noted here as well.

e.g.

%cst_2 = arith.constant dense<[1, 2, 3, 0]> : tensor<4xi64>
%2 = "tosa.transpose"(%cst, %cst_2) : (tensor<64x3x3x32xf32>, tensor<4xi64>) -> tensor<3x3x32x64xf32>

Profiling with callgrind seems to reveal the issue being in the ConstantTransposeOptimization here which is marked as a canonicalisation pattern.

valgrind

Overall, not sure if something like this should be part of the Canonicalizer in the first place for a variety of reasons.
Suppose this issue needs to be migrated/moved to LLVM repo itself ?!

Steps to reproduce

cmake -G Ninja .. \
	-DCMAKE_INSTALL_PREFIX=./install \
	-DCMAKE_BUILD_TYPE=Release \
	-DIREE_ENABLE_ASSERTIONS=ON \
	-DIREE_BUILD_COMPILER=ON \
	-DIREE_BUILD_TESTS=OFF \
	-DIREE_BUILD_BENCHMARKS=OFF \
	-DIREE_BUILD_SAMPLES=OFF
cmake --build . --target install -- -k 0
  • Convert model to MLIR:
iree-import-tflite inception_v3_1_default_1.tflite -o inception_v3.mlir
  • Compile model and extract timings
./iree-translate --iree-mlir-to-vm-bytecode-module --iree-input-type=tosa --iree-hal-target-backends=vulkan-spirv --iree-vulkan-target-triple=valhall-unknown-android11 inception_v3.mlir -o inception_v3.mali-target.vmfb --mlir-timing

Oh, my bad. I added that pattern previously with a naive implementation. Later I introduced a similar pattern in Linalg and improved it with better implementation. But never gotten back to improve the TOSA one. I guess the pattern in TOSA can be deleted now given at Linalg level we can fold it. Or update it like the way it is written in Linalg to improve it.

Thanks for your prompt response @antiagainst.
Yes seems sensible to either completely remove or move outside the canonicalize, rework as per how linalg is written and put it on its own pass.

Will have a look and upstream a fix. Should I leave this open until fix is merged?

SGTM, thanks!

https://reviews.llvm.org/D124685 is landed. Thanks @GeorgeARM! Closing this. Please reopen if you still see issues afterwards.