BodyPix on MacOS - Dilation not supported for AutoPadType::SAME_UPPER or AutoPadType::SAME_LOWER

Question

BodyPix on MacOS - Dilation not supported for AutoPadType::SAME_UPPER or AutoPadType::SAME_LOWER

cansik opened this issue 4 months ago · comments

Florian Bruggisser commented 4 months ago

Issue Type

Bug

OS

Mac OS

OS architecture

aarch64

Programming Language

Python

Framework

ONNX

Model name and Weights/Checkpoints URL

https://github.com/PINTO0309/PINTO_model_zoo/tree/main/035_BodyPix

Description

When running the demo code for the bodypix model (bodypix_resnet50_stride16_1x3x480x640.onnx), I receive the following error message. I already tried to make the previews converted models working, but I wasn't able to post-process the part map correctly. I am looking forward to use the new onnx based ones.

ONNX Versions:

onnx==1.15.0
onnxruntime==1.16.3

Relevant Log Output

2024-01-26 10:54:34.340251 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running FusedConv node. Name:'resnet_v1_50/block4/unit_1/bottleneck_v1/conv2/BatchNorm/batchnorm_1/add_1' Status Message: Dilation not supported for AutoPadType::SAME_UPPER or AutoPadType::SAME_LOWER.
Traceback (most recent call last):
  File "/temp/bodypix-single-demo.py", line 754, in <module>
    main()
  File "/temp/bodypix-single-demo.py", line 687, in main
    foreground_mask_zero_or_255, colored_mask_classid, keypoints_classidscorexy = model_bodypix(debug_image)
  File "/temp/bodypix-single-demo.py", line 421, in __call__
    outputs = super().__call__(input_datas=[inferece_image])
  File "/temp/bodypix-single-demo.py", line 162, in __call__
    self._model(
  File "/temp/venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running FusedConv node. Name:'resnet_v1_50/block4/unit_1/bottleneck_v1/conv2/BatchNorm/batchnorm_1/add_1' Status Message: Dilation not supported for AutoPadType::SAME_UPPER or AutoPadType::SAME_LOWER.

URL or source code for simple inference testing code

https://github.com/PINTO0309/PINTO_model_zoo/blob/main/035_BodyPix/demo/demo_bodypix_single_onnx.py

Katsuya Hyodo · Answer 1 · Fri Jan 26 2024 18:20:15 GMT+0800 (China Standard Time)

I know; it doesn't work on Linux as well as Mac.

Florian Bruggisser · Answer 2 · Fri Jan 26 2024 18:21:35 GMT+0800 (China Standard Time)

It doesn't seem to be a problem with the os, but with the execution provider. I tired it on windows with CPU and the same error happened. With DirectML or CUDA, it isn't a problem.

Katsuya Hyodo · Answer 3 · Fri Jan 26 2024 18:22:39 GMT+0800 (China Standard Time)

You're right. I have already confirmed that in advance as well. So far, except for the TensorRT Provider, the runtime seems to be buggy.

Florian Bruggisser · Answer 4 · Fri Jan 26 2024 18:31:49 GMT+0800 (China Standard Time)

Ok, I will have a look if the openvino runtime has the same issues (should be able to drop the onnx directly into it), maybe it's working with an alternative backend. Anyway, already thanks a lot for the model conversion and example script you've provided!

It would be really helpful to be able to be able to run bodypix on various machines, since it's one of the only pretrained bodypart-segmentation models. Using the python-tf-bodypix version is quite difficult to install lately, that's why I wanted to make a clean rewrite based on onnx or openvino.

Katsuya Hyodo · Answer 5 · Fri Jan 26 2024 18:33:46 GMT+0800 (China Standard Time)

That is a great initiative. I've done some miscellaneous ONNX conversions as a hobby and will implement a way to eliminate the above error when I get around to it. It's not very difficult.

Florian Bruggisser · Answer 6 · Fri Jan 26 2024 19:22:21 GMT+0800 (China Standard Time)

I was able to add openvino as additional runtime. The output seems correct as far as I can tell (somehow weird translated, but I experienced that with DirectML as well). It's of course not as fast, but at least it's a solution which runs on any OS (on CPU).

Would you be open for a PR with the openvino runtime and some additional cleanups of the demo script?

Florian Bruggisser · Answer 7 · Fri Jan 26 2024 23:46:23 GMT+0800 (China Standard Time)

Affine Transformation and Resize to Original Size

Ok, the problem of the translation was because I did not resize the output maps to the original image size before applying the affine transform. It seems to work now, but I am not sure if it's the correct way. For poses that are detected further away from the center, I am still getting masks that do not match the original image (are a bit shifted to the left or right):

I am applying the affine transformation (# Fine-tune position of mask image) and then resize the output to the original image size. Am I missing something?

I've already added padding for the input image, but it the offset is still visible.

Unique Keypoints

I also noticed, that sometimes too many keypoints are returned from extract_max_score_points_unique. I've added the following line of code to only extract the unique indices to always be able to create a valid pose.

# only get unique values
unique_first_values, unique_indices = np.unique(keypoints_classidscorexy[:, 0], return_index=True)
keypoints_classidscorexy = keypoints_classidscorexy[unique_indices]

Thresholds / Constants

I am just thinking about if it would make sense to not fix thresholds inside the graph, but expose them as inputs:

Of course it would be possible to use onnx and get the specific node and adjust the value by code, but wouldn't an input make more sense?

Part Color Overlapping

Maybe related to the threshold thing, but it seems that the colored part map overlaps at the edge and creates a rainbow of parts. Do you have an idea why this is happening?

Katsuya Hyodo · Answer 8 · Sat Jan 27 2024 00:02:44 GMT+0800 (China Standard Time)

I am applying the affine transformation (# Fine-tune position of mask image) and then resize the output to the original image size. Am I missing something?

I have made significant changes to the processing flow to keep processing to only the minimum necessary to optimize the model. The meaning of this optimization is that all computational graphs that would not have a fixed model shape were recalculated to have a fixed shape. (This is my own tuning technique and may be difficult to understand.)

I also noticed, that sometimes too many keypoints are returned from extract_max_score_points_unique. I've added the following line of code to only extract the unique indices to always be able to create a valid pose.

I don't think you are wrong. I think my implementation is pretty messy because I was processing the model while optimizing the model and testing single human inference, while also thinking of ideas to efficiently perform multiple person detection. I agree with your implementation.

Of course it would be possible to use onnx and get the specific node and adjust the value by code, but wouldn't an input make more sense?

Yes, it is. That's the part I was quite torn about what to do, too. Many of the models I have committed to are very rarely used by good engineers like yourself. Thus, it seems more likely that people are looking for a demonstration that works quickly and without thinking. This was the only material for the final decision. In fact, all of the models I use in my research are processed into models where thresholds can be entered externally.

Maybe related to the threshold thing, but it seems that the colored part map overlaps at the edge and creates a rainbow of parts. Do you have an idea why this is happening?

I think that the Resize (Bilinear) and Sigmoid near the last layer and the part of the mask generation that forces the division by 255 and the resultant 0 or 1 to be processed as a boolean may be the cause of such an error. I haven't looked into it in much detail, but I can't deny that there are such discrepancies, especially since the post-processing part is implemented in a very forced manner. 16 strides, so a very small vertical and horizontal ROI is linearly stretched by a factor of 16. The specification is to be compensated by offset, but it does not seem to be working. This offset was the output of the model in Google's design, but in my design it is calculated and embedded in the model.

Both Google's official implementation five years ago and the tf-bodypix repository I cited were originally where all post-processing was handled programmatically. However, those post-processing was designed in such a way that GPUs and accelerators could not be used effectively. In other words, the post-processing is designed to be as efficient as possible in hardware in exchange for allowing a small amount of accuracy degradation.