Runtime stuck on execution of inference of YOLOv5 compiled from ONNX

Question

Runtime stuck on execution of inference of YOLOv5 compiled from ONNX

maximiliankir opened this issue a month ago · comments

Maximilian Kirschner commented a month ago

What happened?

When running the inference of a YOLOv5 model, which has been imported from ONNX and compiled for the CUDA backend, the execution does not terminate. It just keeps on running on the first input image.
CPU usage is ~0%, so I suspect the execution is just stuck somewhere.

Steps to reproduce your issue

Imported with iree-import-onnx yolov5s.onnx -o yolov5_onnx.mlir
Compiled with iree-compile --iree-hal-target-backends=cuda --iree-hal-cuda-llvm-target-arch=sm_87 yolov5_onnx.mlir -o yolov5_onnx_cuda.vmfb

Use the runtime in python with:

...
# Load flatbuffer of yolo model from file
with open(iree_fb_path, "rb") as f:
    flatbuffer = f.read()

gpu_device = ireert.get_device("cuda")
config = ireert.Config(device=gpu_device)
# TODO Gives warning about unsafe copy of unaligned VmModule buffer
yolo_module = ireert.VmModule.from_flatbuffer(config.vm_instance, flatbuffer)
modules = config.default_vm_modules + (yolo_module,)

context = ireert.SystemContext(vm_modules=modules, config=config)

invoker = context.modules.module["torch_jit"]

batch = ireert.asdevicearray(gpu_device, preprocessed_img)
result = invoker(batch)
...

What component(s) does this issue relate to?

No response

Version information

candidate-20240605.915

Additional context

It works when the model was compiled using the tf-importer and a Tensorflow SavedModel. The TF model uses FP32, the ONNX model FP16. Thats one of the reasons, why I want to import from the ONNX model.

I attached a snippet of the MLIR file produced by the ONNX importer (without weights).

yolov5s.mlir.txt

Maximilian Kirschner · Answer 1 · Thu Jun 06 2024 19:32:45 GMT+0800 (China Standard Time)

I need some advice on how to debug this further. How can I find the part where the execution gets stuck?

Maximilian Kirschner · Answer 2 · Thu Jun 06 2024 19:47:28 GMT+0800 (China Standard Time)

It seems to be the combination of ONNX and CUDA. ONNX frontend with CPU works. TF frontend and CUDA works too.

Could be related to #17376. @ScottTodd Can you tell me, what you learned there?

Scott Todd · Answer 3 · Thu Jun 06 2024 23:13:24 GMT+0800 (China Standard Time)

Sounds like #16666 is showing up outside of the individual op tests. Not sure, needs further debugging through the stack.

Maximilian Kirschner · Answer 4 · Wed Jun 12 2024 16:51:42 GMT+0800 (China Standard Time)

I ran the model with --trace_execution. It seems like it gets stuck on dealloca of the cuda Device.

The trace is attached.
onnx_cuda_trace.log