iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

Home Page:http://iree.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(znver4/cpu) numerics issues/bad results from different compile flags on SDXL VAE

monorimet opened this issue · comments

My success/failure cases here have been a bit inconsistent, so I'll try to be explicit as possible.

@daveliddell and I are trying to get SDXL VAE on f32 working with AIE offload, and first we need to make sure that without the offload, it is working and giving correct results.

Assume we are operating on the tip of shared/tresleches-cpu branch (I am working from a merge of this and shared/tresleches-united at tresleches-united-cpu-merge. I'm sorry. Feel free to take my results (I will clarify) with a grain of salt.

First, links to artifacts:

MLIR (vae decode, inp. size 1x4x64x64, f32, attention decomposed):
weights inlined ~600mb
weights externalized -- says SDXL base in filename but don't worry- they're the same (~300KB)

Example input:
example_input.npy

Golden output (pytorch cpu result) given example input above:
vae_golden_output.npy

Parameters (if using external weights):
vae_params_f32 (~300MB)

I present a few success/failure cases (compile commands):

(note: these all fail at runtime for me, and don't give any outputs. I may have some broken setup, so I'll at least recant what Dave and I found on call yesterday.)

  1. Without winograd or bf16 demotion
iree-compile `
 --iree-hal-target-backends=llvm-cpu `
 --iree-llvmcpu-target-cpu=znver4 `
 --iree-flow-enable-fuse-padding-into-linalg-consumer-ops `
 --iree-llvmcpu-enable-ukernels=mmt4d,pack,unpack `
 --iree-flow-collapse-reduction-dims `
 --iree-opt-const-expr-max-size-increase-threshold=1000000000000000 `
 --iree-opt-const-eval=false `
 .\vae_decode.mlir -o ./vae_f32.vmfb

Fails for me at runtime (exits without an output) but works for @daveliddell on what should be the same hardware, giving good numerics (~2e-5 max diff against pytorch cpu), could be IREE version (see note about my branch above)

  1. With bf16 demotion
iree-compile `
 --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-demote-contraction-inputs-to-bf16))" `
 --iree-hal-target-backends=llvm-cpu `
 --iree-llvmcpu-target-cpu=znver4 `
 --iree-flow-enable-fuse-padding-into-linalg-consumer-ops `
 --iree-llvmcpu-enable-ukernels=mmt4d,pack,unpack `
 --iree-flow-collapse-reduction-dims `
 --iree-opt-const-expr-max-size-increase-threshold=1000000000000000 `
 --iree-opt-const-eval=false `
 .\vae_decode.mlir -o ./vae_f32.vmfb

Gives worse numerics (on the order of .4 max diff if I remember correctly) -- @daveliddell if you could follow up with the exact results I'd appreciate it

And naturally with winograd it diverges further, I'll give the compile input anyways.

iree-compile `
 --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-linalg-ext-convert-conv2d-to-winograd{replace-all-convs=true},iree-global-opt-demote-contraction-inputs-to-bf16))" `
 --iree-hal-target-backends=llvm-cpu `
 --iree-llvmcpu-target-cpu=znver4 `
 --iree-flow-enable-fuse-padding-into-linalg-consumer-ops `
 --iree-llvmcpu-enable-ukernels=mmt4d,pack,unpack `
 --iree-flow-collapse-reduction-dims `
 --iree-opt-const-expr-max-size-increase-threshold=1000000000000000 `
 --iree-opt-const-eval=false `
 .\vae_decode.mlir -o ./vae_f32.vmfb

Run:

iree-run-module `
  --device=local-task `
  --module=vae-cpu.vmfb `
  --parameters=model=vae_f32.safetensors `
  --function=main `
  --input=1x4x64x64xf32=@example_input.npy `
  --expected_output=@vae_golden_output.npy

The provided IR with external weights has an attention op. This was unintended.

This IR has the attention op decomposed:
https://sharkpublic.blob.core.windows.net/sharkpublic/ean/vae_f32_num/stable_diffusion_xl_base_1_0_bs1_512x512_fp32_vae_decode_decomp.mlir

Okay, I'm able to compile and run the model. Thanks for the update from @monorimet .

This is the script that I used to compare the results. I don't see numeric issues with bf16 demotion only. They are close with (atol=0.1, rtol=0.05) config. The compilation flag is:

iree-compile \
  --iree-hal-target-backends=llvm-cpu \
  --iree-llvmcpu-target-cpu=znver4 \
  --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu \
  ~/vae_decode_decomp.mlir -o /tmp/vae.vmfb \
  --iree-llvmcpu-enable-ukernels=mmt4d \
  --iree-flow-collapse-reduction-dims \
  --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-demote-contraction-inputs-to-bf16))"
❯ python compare_npy.py -a vae_output.npy -b vae_golden_output.npy
all_close: True. shape: (1, 3, 512, 512)
a[0, 0, 0]: 0.3924916982650757, b[0, 0, 0]: 0.39249464869499207
a[0, 1, 0]: 0.3716752529144287, b[0, 1, 0]: 0.3716789186000824
a[0, 0, 1]: 0.3770805299282074, b[0, 0, 1]: 0.3770836293697357

I do see numerical issues in winograd only (i.e., without demotino). Unfortunately, it is the tradeoff of using winograd transforms. (@Max191 can chime in given that he has more context) I wonder that if this is at the level of integration tests? Or it is at the level of demo? The vae_golden_output.npy is not a human-friendly picture. If we can generate reasonable picture, I think it is fine.

❯ python compare_npy.py -a vae_output.npy -b vae_golden_output.npy
all_close: False. shape: (1, 3, 512, 512)
a[0, 0, 0]: 0.4421127140522003, b[0, 0, 0]: 0.39249464869499207
a[0, 1, 0]: 0.4923735558986664, b[0, 1, 0]: 0.3716789186000824
a[0, 0, 1]: 0.4711473286151886, b[0, 0, 1]: 0.3770836293697357

This is the script that I used to compare the results. I don't see numeric issues with bf16 demotion only. They are close with (atol=0.1, rtol=0.05) config.

This should be an OK tolerance to go with for VAE, since its results get rounded to the nearest int8.
I mentioned it because we did not get good images with it added, and noticed that without bf16 demotion, the error is on the order of 2e-5, and with, it jumps up to around 2e-2.

If we expect that pass to cause such changes in numerics, that's OK, but it does leave us less room for optimizations like Winograd that have a distinct accuracy/perf tradeoff.