iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

Home Page:http://iree.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ONNX "resize" op test failures

ScottTodd opened this issue · comments

What happened?

#17330 updates our LLVM and torch-mlir commits, pulling in llvm/torch-mlir#3013. Some tests are newly passing, many tests are still failing somewhere (compiler, runtime numerics), and a few tests are hanging on certain platforms.

At least CUDA is hanging on test_resize_downsample_scales_linear:
https://github.com/iree-org/iree/actions/runs/9034897378/job/24828864270?pr=17330#step:9:1813
I can't reproduce that on Windows though.

Steps to reproduce your issue

Generally follow the instructions at https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests and pull the config files from this repo.

For example, to run on CUDA:

pytest onnx/ -k test_resize -rA \
  --config-files=D:\dev\projects\iree\build_tools\pkgci\external_test_suite\onnx_gpu_cuda.json \
  --ignore-xfails

or Vulkan:

pytest onnx/ -k test_resize -rA \
  --config-files=D:\dev\projects\iree\build_tools\pkgci\external_test_suite\onnx_gpu_vulkan.json \
  --ignore-xfails
Config Logs
CPU https://gist.github.com/ScottTodd/0778165b2d31a54bfefbb9fa2b2662d6
CUDA https://gist.github.com/ScottTodd/dd34be6577da489f3d5b6b0a0a65ed0d
Vulkan https://gist.github.com/ScottTodd/b2f509585bee804ebd900e2144258241

Note that Vulkan has model.mlir:4:10: error: failed to legalize operation 'arith.fptosi' that was explicitly marked illegal

What component(s) does this issue relate to?

Frontends, Compiler, Runtime

Version information

No response

Additional context

No response

FYI @AmosLewis this is the reason why llvm/torch-mlir#3013 was ultimately dropped from the integrate #17330.

FYI @AmosLewis this is the reason why llvm/torch-mlir#3013 was ultimately dropped from the integrate #17330.

Will you start a new PR to bump it next? Do you have any idea is it a torch-mlir bug or is it an iree bug?

  • I suspect the Vulkan failed to legalize operation 'arith.fptosi' error is in upstream MLIR SPIRV (missing lowering)
  • Numerical errors in tests could be issues in the torch-mlir lowerings
  • CUDA hang ... no idea, couldn't get much from CI logs and couldn't reproduce on Windows. Maybe a miscompile (torch-mlir lowering) or runtime issue (IREE CUDA HAL), if compilation succeeded but the hang was a runtime.

nod-ai/SHARK-Turbine#616 the model and failure resize mlir are listed in the description

Will you start a new PR to bump it next?

I don't plan to do it myself. We have an integration rotation schedule and the integrates of this week were already done out-of-schedule :-)

We have a separate rotation for updating torch-mlir (in fact, @AmosLewis is up for next week 🤔). They are usually updated separately but needed to be updated together in this case.