ONNX "resize" op test failures

Question

ONNX "resize" op test failures

ScottTodd opened this issue a month ago · comments

What happened?

#17330 updates our LLVM and torch-mlir commits, pulling in llvm/torch-mlir#3013. Some tests are newly passing, many tests are still failing somewhere (compiler, runtime numerics), and a few tests are hanging on certain platforms.

At least CUDA is hanging on test_resize_downsample_scales_linear:
https://github.com/iree-org/iree/actions/runs/9034897378/job/24828864270?pr=17330#step:9:1813
I can't reproduce that on Windows though.

Steps to reproduce your issue

Generally follow the instructions at https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests and pull the config files from this repo.

For example, to run on CUDA:

pytest onnx/ -k test_resize -rA \
  --config-files=D:\dev\projects\iree\build_tools\pkgci\external_test_suite\onnx_gpu_cuda.json \
  --ignore-xfails

or Vulkan:

pytest onnx/ -k test_resize -rA \
  --config-files=D:\dev\projects\iree\build_tools\pkgci\external_test_suite\onnx_gpu_vulkan.json \
  --ignore-xfails

Config	Logs
CPU	https://gist.github.com/ScottTodd/0778165b2d31a54bfefbb9fa2b2662d6
CUDA	https://gist.github.com/ScottTodd/dd34be6577da489f3d5b6b0a0a65ed0d
Vulkan	https://gist.github.com/ScottTodd/b2f509585bee804ebd900e2144258241

Note that Vulkan has model.mlir:4:10: error: failed to legalize operation 'arith.fptosi' that was explicitly marked illegal

What component(s) does this issue relate to?

Frontends, Compiler, Runtime

Version information

No response

Additional context

No response

Chi_Liu commented a month ago

#17358

Benoit Jacob · Answer 1 · Sat May 11 2024 00:47:43 GMT+0800 (China Standard Time)

FYI @AmosLewis this is the reason why llvm/torch-mlir#3013 was ultimately dropped from the integrate #17330.

Chi_Liu · Answer 2 · Sat May 11 2024 01:07:16 GMT+0800 (China Standard Time)

FYI @AmosLewis this is the reason why llvm/torch-mlir#3013 was ultimately dropped from the integrate #17330.

Will you start a new PR to bump it next? Do you have any idea is it a torch-mlir bug or is it an iree bug?

Scott Todd · Answer 3 · Sat May 11 2024 01:12:58 GMT+0800 (China Standard Time)

I suspect the Vulkan failed to legalize operation 'arith.fptosi' error is in upstream MLIR SPIRV (missing lowering)
Numerical errors in tests could be issues in the torch-mlir lowerings
CUDA hang ... no idea, couldn't get much from CI logs and couldn't reproduce on Windows. Maybe a miscompile (torch-mlir lowering) or runtime issue (IREE CUDA HAL), if compilation succeeded but the hang was a runtime.

Chi_Liu · Answer 4 · Sat May 11 2024 01:28:56 GMT+0800 (China Standard Time)

nod-ai/SHARK-Turbine#616 the model and failure resize mlir are listed in the description

Benoit Jacob · Answer 5 · Sat May 11 2024 01:41:04 GMT+0800 (China Standard Time)

Will you start a new PR to bump it next?

I don't plan to do it myself. We have an integration rotation schedule and the integrates of this week were already done out-of-schedule :-)

Scott Todd · Answer 6 · Sat May 11 2024 01:43:51 GMT+0800 (China Standard Time)

We have a separate rotation for updating torch-mlir (in fact, @AmosLewis is up for next week 🤔). They are usually updated separately but needed to be updated together in this case.