ONNX "resize" op test failures
ScottTodd opened this issue · comments
What happened?
#17330 updates our LLVM and torch-mlir commits, pulling in llvm/torch-mlir#3013. Some tests are newly passing, many tests are still failing somewhere (compiler, runtime numerics), and a few tests are hanging on certain platforms.
At least CUDA is hanging on test_resize_downsample_scales_linear
:
https://github.com/iree-org/iree/actions/runs/9034897378/job/24828864270?pr=17330#step:9:1813
I can't reproduce that on Windows though.
Steps to reproduce your issue
Generally follow the instructions at https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests and pull the config files from this repo.
For example, to run on CUDA:
pytest onnx/ -k test_resize -rA \
--config-files=D:\dev\projects\iree\build_tools\pkgci\external_test_suite\onnx_gpu_cuda.json \
--ignore-xfails
or Vulkan:
pytest onnx/ -k test_resize -rA \
--config-files=D:\dev\projects\iree\build_tools\pkgci\external_test_suite\onnx_gpu_vulkan.json \
--ignore-xfails
Note that Vulkan has model.mlir:4:10: error: failed to legalize operation 'arith.fptosi' that was explicitly marked illegal
What component(s) does this issue relate to?
Frontends, Compiler, Runtime
Version information
No response
Additional context
No response
FYI @AmosLewis this is the reason why llvm/torch-mlir#3013 was ultimately dropped from the integrate #17330.
FYI @AmosLewis this is the reason why llvm/torch-mlir#3013 was ultimately dropped from the integrate #17330.
Will you start a new PR to bump it next? Do you have any idea is it a torch-mlir bug or is it an iree bug?
- I suspect the Vulkan
failed to legalize operation 'arith.fptosi'
error is in upstream MLIR SPIRV (missing lowering) - Numerical errors in tests could be issues in the torch-mlir lowerings
- CUDA hang ... no idea, couldn't get much from CI logs and couldn't reproduce on Windows. Maybe a miscompile (torch-mlir lowering) or runtime issue (IREE CUDA HAL), if compilation succeeded but the hang was a runtime.
nod-ai/SHARK-Turbine#616 the model and failure resize mlir are listed in the description
Will you start a new PR to bump it next?
I don't plan to do it myself. We have an integration rotation schedule and the integrates of this week were already done out-of-schedule :-)
We have a separate rotation for updating torch-mlir (in fact, @AmosLewis is up for next week 🤔). They are usually updated separately but needed to be updated together in this case.