Compiler failure: cannot get concrete layout for contraction in vector distribution

Question

Compiler failure: cannot get concrete layout for contraction in vector distribution

IanNod opened this issue 2 months ago · comments

What happened?

Error output compiling sdxl unet model with varying batch sizes. See the same/similar error for batch sizes 8, 16, and 24 (seems a pattern of batch size divisible by 8)

failed to translate executables
failed to translate executables
<unknown>:0: error: cannot get concrete layout for contraction
../sdxl-scripts/tmp/PNDM_unet_30.mlir:1813:11: error: 'func.func' op failed to distribute
    %40 = torch.aten.silu %39 : !torch.vtensor<[32,1280],f16> -> !torch.vtensor<[32,1280],f16>
          ^
../sdxl-scripts/tmp/PNDM_unet_30.mlir:1715:10: note: called from
    %6 = call @forward(%0, %1, %2, %3, %4, %5) : (!torch.vtensor<[16,4,128,128],f16>, !torch.vtensor<[32,64,2048],f16>, !torch.vtensor<[32,1280],f16>, !torch.vtensor<[32,6],f16>, !torch.vtensor<[1],f16>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[16,4,128,128],f16>
         ^
../sdxl-scripts/tmp/PNDM_unet_30.mlir:1813:11: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx940", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<MFMA_F16_16x16x16_F32>, <MFMA_F16_32x32x8_F32>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536>>, ukernels = "none", waves_per_eu = 2 : i64}>
    %40 = torch.aten.silu %39 : !torch.vtensor<[32,1280],f16> -> !torch.vtensor<[32,1280],f16>
          ^
../sdxl-scripts/tmp/PNDM_unet_30.mlir:1715:10: note: called from
    %6 = call @forward(%0, %1, %2, %3, %4, %5) : (!torch.vtensor<[16,4,128,128],f16>, !torch.vtensor<[32,64,2048],f16>, !torch.vtensor<[32,1280],f16>, !torch.vtensor<[32,6],f16>, !torch.vtensor<[1],f16>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[16,4,128,128],f16>
         ^
<unknown>:0: error: cannot get concrete layout for contraction
../sdxl-scripts/tmp/PNDM_unet_30.mlir:1929:11: error: 'func.func' op failed to distribute
    %91 = torch.aten.silu %90 : !torch.vtensor<[32,1280],f16> -> !torch.vtensor<[32,1280],f16>
          ^
../sdxl-scripts/tmp/PNDM_unet_30.mlir:1715:10: note: called from
    %6 = call @forward(%0, %1, %2, %3, %4, %5) : (!torch.vtensor<[16,4,128,128],f16>, !torch.vtensor<[32,64,2048],f16>, !torch.vtensor<[32,1280],f16>, !torch.vtensor<[32,6],f16>, !torch.vtensor<[1],f16>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[16,4,128,128],f16>
         ^
../sdxl-scripts/tmp/PNDM_unet_30.mlir:1929:11: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx940", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<MFMA_F16_16x16x16_F32>, <MFMA_F16_32x32x8_F32>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536>>, ukernels = "none", waves_per_eu = 2 : i64}>
    %91 = torch.aten.silu %90 : !torch.vtensor<[32,1280],f16> -> !torch.vtensor<[32,1280],f16>
          ^
../sdxl-scripts/tmp/PNDM_unet_30.mlir:1715:10: note: called from
    %6 = call @forward(%0, %1, %2, %3, %4, %5) : (!torch.vtensor<[16,4,128,128],f16>, !torch.vtensor<[32,64,2048],f16>, !torch.vtensor<[32,1280],f16>, !torch.vtensor<[32,6],f16>, !torch.vtensor<[1],f16>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[16,4,128,128],f16>
         ^

mlir-ir-after all dump shows first failed pass at vector distribution and can be found here https://sharkpublic.blob.core.windows.net/sharkpublic/ian/ir_after_all.txt

Steps to reproduce your issue

Failing dispatch can be found: https://gist.github.com/IanNod/5588ef0263f5b947753879239b6e73d9
Full model this dispatch came from (batch size 16): https://sharkpublic.blob.core.windows.net/sharkpublic/ian/PNDM_unet_30.mlir
Attention spec file use from: https://github.com/nod-ai/SHARK-Turbine/blob/main/models/turbine_models/custom_models/sdxl_inference/default_mfma_attn_spec.mlir

Compile command to reproduce error:
./build/tools/iree-compile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx940 --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-flow-enable-aggressive-fusion --iree-global-opt-enable-fuse-horizontal-contractions=true --iree-opt-aggressively-propagate-transposes=true --iree-codegen-llvmgpu-use-vector-distribution=true --iree-global-opt-propagate-transposes=true --iree-opt-const-eval=false --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-codegen-gpu-native-math-precision=true --iree-rocm-waves-per-eu=2 --iree-flow-inline-constants-max-byte-length=1 --iree-preprocessing-pass-pipeline="builtin.module(iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-pad-to-intrinsics))" --iree-codegen-transform-dialect-library=../sdxl-scripts/tmp/attention_and_matmul_spec_mfma.mlir "compiled_scheduled_unet_run_forward$async_dispatch_3.mlir" -o test.vmfb

What component(s) does this issue relate to?

Compiler

Version information

6291224

Additional context

No response