[YOLOX-TI] ERROR: onnx_op_name: /head/ScatterND

Question

[YOLOX-TI] ERROR: onnx_op_name: /head/ScatterND

mikel-brostrom opened this issue a year ago · comments

Mike commented a year ago

Issue Type

Others

onnx2tf version number

1.8.1

onnx version number

1.13.1

tensorflow version number

2.12.0

Download URL for ONNX

yolox_nano_ti_lite_26p1_41p8.zip

Parameter Replacement JSON

{
    "format_version": 1,
    "operations": [
        {
            "op_name": "/head/ScatterND",
            "param_target": "inputs",
            "param_name": "/head/Concat_1_output_0",
            "values": [1,85,52,52]
        }
    ]
}

Description

Hi @PINTO0309. After our lengthy discussion regarding INT8 YOLOX export I decided to try out Ti's version of these models (https://github.com/TexasInstruments/edgeai-yolox/tree/main/pretrained_models). It looked to me that you manged to INT8-export those so maybe you could provide some hints 😄. I just downloaded the ONNX version of YOLOX-nano. For this model, the following fails:

onnx2tf -i ./yolox_nano.onnx -o yolox_nano_saved_model

The error I get:

ERROR: input_onnx_file_path: /datadrive/mikel/edgeai-yolox/yolox_nano.onnx
ERROR: onnx_op_name: /head/ScatterND
ERROR: Read this and deal with it. https://github.com/PINTO0309/onnx2tf#parameter-replacement
ERROR: Alternatively, if the input OP has a dynamic dimension, use the -b or -ois option to rewrite it to a static shape and try again.
ERROR: If the input OP of ONNX before conversion is NHWC or an irregular channel arrangement other than NCHW, use the -kt or -kat option.
ERROR: Also, for models that include NonMaxSuppression in the post-processing, try the -onwdt option.

Research
Export error
I tried to overwrite the values of the parameter by the replacement json provided above with no luck
Project need
Operation that fails can be found in the image below:

Mike commented a year ago

Get it!

Katsuya Hyodo · Answer 1 · Fri Mar 24 2023 18:03:15 GMT+0800 (China Standard Time)

Knowing that TI's model is rather verbose, I optimized it independently and created a script to replace all ScatterND with Slice.

https://github.com/PINTO0309/PINTO_model_zoo/tree/main/363_YOLO-6D-Pose

Mike · Answer 2 · Fri Mar 24 2023 18:04:38 GMT+0800 (China Standard Time)

Thank you for your quick response

Katsuya Hyodo · Answer 3 · Fri Mar 24 2023 18:08:11 GMT+0800 (China Standard Time)

I will be home with my parents today, tomorrow, and the day after, so I will not be able to provide detailed testing or assistance.

Mike · Answer 4 · Fri Mar 24 2023 18:11:10 GMT+0800 (China Standard Time)

Thanks for the heads up! Testing this on my own on a detection model, not on pose. Let's see if I manage to get it working. The eval result on both models is as follows:

	YOLOX nano ONNX	YOLOX-Ti nano ONNX
mAP@0.5:0.95	0.256	0.261
mAP@0.5	0.411	0.418

Mike · Answer 5 · Fri Mar 24 2023 19:10:18 GMT+0800 (China Standard Time)

Ok. As I didn't see ScatterND in the original model, I checked what the differences where. I found out that this

def meshgrid(*tensors):
    if _TORCH_VER >= [1, 10]:
        return torch.meshgrid(*tensors, indexing="ij")
    else:
        return torch.meshgrid(*tensors)
 

def decode_outputs(self, outputs, dtype):
        grids = []
        strides = []
        for (hsize, wsize), stride in zip(self.hw, self.strides):
            yv, xv = meshgrid([torch.arange(hsize), torch.arange(wsize)])
            grid = torch.stack((xv, yv), 2).view(1, -1, 2)
            grids.append(grid)
            shape = grid.shape[:2]
            strides.append(torch.full((*shape, 1), stride))
 
        grids = torch.cat(grids, dim=1).type(dtype)
        strides = torch.cat(strides, dim=1).type(dtype)
 
        outputs = torch.cat([
            (outputs[..., 0:2] + grids) * strides,
            torch.exp(outputs[..., 2:4]) * strides,
            outputs[..., 4:]
        ], dim=-1)
        return outputs

gives:

While this:

def (self, outputs, dtype):
        grids = []
        strides = []
        for (hsize, wsize), stride in zip(self.hw, self.strides):
            yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])
            grid = torch.stack((xv, yv), 2).view(1, -1, 2)
            grids.append(grid)
            shape = grid.shape[:2]
            strides.append(torch.full((*shape, 1), stride))
 
        grids = torch.cat(grids, dim=1).type(dtype)
        strides = torch.cat(strides, dim=1).type(dtype)
 
        outputs[..., :2] = (outputs[..., :2] + grids) * strides
        outputs[..., 2:4] = torch.exp(outputs[..., 2:4]) * strides
        return outputs

gives:

This as well as some other minor fixes make it possible to get rid of ScatterND completely.

Katsuya Hyodo · Answer 6 · Fri Mar 24 2023 19:59:24 GMT+0800 (China Standard Time)

Excellent.

Perhaps the overall size of the model should be significantly smaller. 64-bit index values are almost always overly precise. However, since the computational efficiency of Gather and Scatter is supposed to be high to begin with, I am concerned about how much the inference performance will deteriorate after the change to Slice.

Mike · Answer 7 · Fri Mar 24 2023 20:36:26 GMT+0800 (China Standard Time)

The model performance did not decrease after the changes and for the first time I got results on one of the quantized models (dynamic_range_quant).

Model	size	mAP^val 0.5:0.95	mAP^val 0.5	size
YOLOX-TI-nano ONNX (original model)	416	0.261	0.418	8.7M
YOLOX-TI-nano ONNX (no ScatterND)	416	0.261	0.418	8.7M
YOLOX-nano TFLite FP16	416	0.261	0.418	4.4M
YOLOX-nano TFLite FP32	416	0.261	0.418	8.7M
YOLOX-nano TFLite full_integer_quant	416	0	0	2.3M
YOLOX-nano TFLite dynamic_range_quant	416	0.249	0.410	2.3M
YOLOX-nano TFLite integer_quant	416	0	0	2.3M

But still nothing for the INT ones though...

Mike · Answer 8 · Fri Mar 24 2023 20:37:20 GMT+0800 (China Standard Time)

Feel free to play around with it

yolox_nano_no_scatternd.zip

😄

Katsuya Hyodo · Answer 9 · Fri Mar 24 2023 20:42:28 GMT+0800 (China Standard Time)

I can't see the structure of the model today, but I believe there were a couple of Sigmoid at the beginning of the post-processing.

What if the model transformation is stopped just before post-processing? However, it is difficult to measure mAP.

e.g.

onnx2tf -i resnet18-v1-7.onnx \
-onimc resnetv15_stage2_conv1_fwd resnetv15_stage2_conv2_fwd

It's an interesting topic and I'd like to try it myself, but I can't easily try it right now.

Mike · Answer 10 · Fri Mar 24 2023 20:57:15 GMT+0800 (China Standard Time)

You are right @PINTO0309 . I missed this:

output = torch.cat(
    [reg_output, obj_output.sigmoid(), cls_output.sigmoid()], 1
)

which in the ONNX model is represented as:

then in the TFLite models these Sigmoid converts into Logistic:

But why is the dynamic range quantized model working and not the rest of the quantized models?

Katsuya Hyodo · Answer 11 · Fri Mar 24 2023 21:08:27 GMT+0800 (China Standard Time)

If I remember correctly, dynamic range is less prone to accuracy degradation because it recalculates the quantization range each time; compared to INT8 full quantization, the inference speed would have been very slow in exchange for maintaining accuracy.

I may be wrong because I do not have an accurate grasp of recent quantization specifications.

By the way,
Sigmoid = Logistic

Mike · Answer 12 · Fri Mar 24 2023 23:33:59 GMT+0800 (China Standard Time)

Maybe a bit out of topic. Anyways, I am using the official TFLite benchmark tool for the exported models and on the specific android device i I am running this on I get that the Float32 models is much faster that the dynamically quantized one.

Mike · Answer 13 · Fri Mar 24 2023 23:38:38 GMT+0800 (China Standard Time)

People are getting the same quantization problems with YOLOv8 ultralytics/ultralytics#1447:
full_integer_quant and integer_quant does not work. dynamic_range_quant works but it is very slow

Mike · Answer 14 · Sat Mar 25 2023 00:12:32 GMT+0800 (China Standard Time)

But then I guess that the only option we have is to perform the sigmoid operation outside the model...

Motoki Kimura · Answer 15 · Sat Mar 25 2023 00:34:04 GMT+0800 (China Standard Time)

@mikel-brostrom
As for the accuracy degradation of YOLOX integer quantization, I think it may be due to the distribution mismatch of xywh and score values.

Just before the last Concat, xywh seems to have a distribution of (min, max)~(0.0, 416.0). On the other hand, scores have a much narrower distribution of (min, max) = (0.0, 1.0) because of sigmoid.

In TFLite quantization, activation is quantized in per-tensor manner. That is, the OR distribution of xywh and scores, (min, max) = (0.0, 416.0), is mapped to integer values of (min, max) = (0, 255) after the Concat. As a result, even if the score is 1.0, after quantization it is mapped to: int(1.0 / 416 * 255) = int(0.61) = 0, resulting in all scores being zero!

A possible solution is to divide xywh tensors by the image size (416) to keep it in the range (min, max) ~ (0.0, 1.0) and then concat with the score tensor so that scores are not "collapsed" due to the per-tensor quantization.

The same workaround is done in YOLOv5:
https://github.com/ultralytics/yolov5/blob/b96f35ce75effc96f1a20efddd836fa17501b4f5/models/tf.py#L307-L310

Mike · Answer 16 · Sat Mar 25 2023 00:38:47 GMT+0800 (China Standard Time)

This was super helpful @motokimura! Will try this out

Motoki Kimura · Answer 17 · Sat Mar 25 2023 00:40:57 GMT+0800 (China Standard Time)

I hope this helps..
When you try this workaround, do not forget to multiply xywh tensors by 416 in the prediction phase!

Mike · Answer 18 · Sat Mar 25 2023 01:31:39 GMT+0800 (China Standard Time)

No change on the INT8 models @motokimura after implementing what you suggested... Still the same results for all the TFLite models, so the problem may primarily be in an operation or set of operations

Motoki Kimura · Answer 19 · Sat Mar 25 2023 01:56:21 GMT+0800 (China Standard Time)

hmm..
As PINTO pointed out, it may be better to compare int8 and float model activations before the decoder part.

#269 (comment)

It may be helpful to export onnx without '--export-det' option and compare the int8 and float outputs.

Katsuya Hyodo · Answer 20 · Sat Mar 25 2023 09:34:16 GMT+0800 (China Standard Time)

Anyways, I am using the official TFLite benchmark tool for the exported models and on the specific android device i I am running this on I get that the Float32 models is much faster that the dynamically quantized one.

First, let me tell you that your results will vary greatly depending on the architecture of the CPU you are using for your verification. If you are using an Intel x64(x86) or AMD x64(x86) architecture CPU, the Float32 model should be able to reason about 10 times faster than the INT8 model. INT8 models are very slow on the x64 architecture. Perhaps the RaspberryPi's ARM64 CPU 4 threads would be 10 times faster. The keyword XNNPACK is a good way to search for information. In the case of Intel's x64 architecture, CPUs of the 10th generation or later differ from CPUs of the 9th generation or earlier in the presence or absence of an optimization mechanism for processing Integer. If you are using a 10th generation or later CPU, it should run about 20% faster.

Therefore, when benchmarking using benchmarking tools, it is recommended to try to do so on ARM64 devices.

The benchmarking in the discussion on the ultralytics thread is not appropriate.

Next, let's look at dynamic range quantization.
My tool does per-channel quantization by default. This is due to the TFLiteConverter specification. per-channel quantization calculates the quantization range for each element of the tensor, which reduces the accuracy degradation and, at the same time, increases the cost of calculating the quantization range, which slows down the inference a little. Also, most of the current edge devices in the world are not optimized for per-channel quantization. For example, EdgeTPU only supports per-tensor quantization. Therefore, if quantization is to be performed with the assumption that the model will be put to practical use in the future, it is recommended that per-tensor quantization be performed during the transformation as follows.

onnx2tf -i xxxx.onnx -oiqt -qt per-tensor

per-channel quant
per-tensor quant

Next, we discuss post-quantization accuracy degradation. I think motoki's point is mostly correct. I think you should first try to split the model at the red line and see how the accuracy changes.

If the Sigmoid in this position does not affect the accuracy, it should work. It is better to think about complex problems by breaking them down into smaller problems without being too hasty.

Mike · Answer 21 · Sat Mar 25 2023 23:20:46 GMT+0800 (China Standard Time)

I just cut the model at the point you suggested by:

onnx2tf -i /datadrive/mikel/yolox_tflite_export/yolox_nano.onnx -b 1 -cotof -cotoa 1e-1 -onimc /head/Concat_6_output_0

But I get the following error:

File "/datadrive/mikel/yolox_tflite_export/env/lib/python3.8/site-packages/onnx2tf/utils/common_functions.py", line 3071, in onnx_tf_tensor_validation
    onnx_tensor_shape = onnx_tensor.shape
AttributeError: 'NoneType' object has no attribute 'shape'

I couldn't find a similar issue and I had the same problem when I tried to cut YOLOX in our previous discussion. I probably misinterpreted how the tool is supposed to be used...

Mike · Answer 22 · Sat Mar 25 2023 23:34:53 GMT+0800 (China Standard Time)

First, let me tell you that your results will vary greatly depending on the architecture of the CPU you are using for your verification. If you are using an Intel x64(x86) or AMD x64(x86) architecture CPU, the Float32 model should be able to reason about 10 times faster than the INT8 model. INT8 models are very slow on the x64 architecture. Perhaps the RaspberryPi's ARM64 CPU 4 threads would be 10 times faster. The keyword XNNPACK is a good way to search for information. In the case of Intel's x64 architecture, CPUs of the 10th generation or later differ from CPUs of the 9th generation or earlier in the presence or absence of an optimization mechanism for processing Integer. If you are using a 10th generation or later CPU, it should run about 20% faster.

Therefore, when benchmarking using benchmarking tools, it is recommended to try to do so on ARM64 devices.

I compiled the benchmark binary for android_arm64. The device has a Exynos9810 which is arm 64-bit. It contains a Mali-G72MP18 GPU. However, I am running the model without GPU accelerators, so the INT8 model must be running on CPU. The CPU got released 2018 so that may explain why the quantized model is that slow...

Katsuya Hyodo · Answer 23 · Sun Mar 26 2023 14:33:52 GMT+0800 (China Standard Time)

But I get the following error:

I came home and tried the same conversion as you.

The following command did not generate an error. It is a little strange that the situation is different in your environment and mine. Since scatternd requires a very complex modification at the moment, would the same error occur in ONNX with scatternd replaced with slice?

onnx2tf -i yolox_nano_no_scatternd.onnx -cotof -cotoa 1e-4 -onimc /head/Concat_6_output_0

tflite
yolox_nano_no_scatternd_float32.tflite.zip

INFO: onnx_output_name: /head/stems.2/act/Relu_output_0 tf_output_name: tf.nn.relu_70/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_convs.2/cls_convs.2.0/conv/Conv_output_0 tf_output_name: tf.math.add_84/Add:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_convs.2/reg_convs.2.0/conv/Conv_output_0 tf_output_name: tf.math.add_85/Add:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_convs.2/cls_convs.2.0/act/Relu_output_0 tf_output_name: tf.nn.relu_71/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_convs.2/reg_convs.2.0/act/Relu_output_0 tf_output_name: tf.nn.relu_72/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_convs.2/cls_convs.2.1/conv/Conv_output_0 tf_output_name: tf.math.add_86/Add:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_convs.2/reg_convs.2.1/conv/Conv_output_0 tf_output_name: tf.math.add_87/Add:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_convs.2/cls_convs.2.1/act/Relu_output_0 tf_output_name: tf.nn.relu_73/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_convs.2/reg_convs.2.1/act/Relu_output_0 tf_output_name: tf.nn.relu_74/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_preds.2/Conv_output_0 tf_output_name: tf.math.add_88/Add:0 shape: (1, 80, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_preds.2/Conv_output_0 tf_output_name: tf.math.add_89/Add:0 shape: (1, 4, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/obj_preds.2/Conv_output_0 tf_output_name: tf.math.add_90/Add:0 shape: (1, 1, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Sigmoid_4_output_0 tf_output_name: tf.math.sigmoid_4/Sigmoid:0 shape: (1, 1, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Sigmoid_5_output_0 tf_output_name: tf.math.sigmoid_5/Sigmoid:0 shape: (1, 80, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Concat_2_output_0 tf_output_name: tf.concat_15/concat:0 shape: (1, 85, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Reshape_2_output_0 tf_output_name: tf.reshape_2/Reshape:0 shape: (1, 85, 169) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Concat_6_output_0 tf_output_name: tf.concat_16/concat:0 shape: (1, 85, 3549) dtype: float32 validate_result:  Matches

onnx2tf -i yolox_nano_no_scatternd.onnx -oiqt -cotof -cotoa 1e-4 -onimc /head/Concat_6_output_0

INT8 tflite
yolox_nano_no_scatternd_integer_quant.tflite.zip

Katsuya Hyodo · Answer 24 · Sun Mar 26 2023 14:46:55 GMT+0800 (China Standard Time)

I compiled the benchmark binary for android_arm64. The device has a Exynos9810 which is arm 64-bit. It contains a Mali-G72MP18 GPU. However, I am running the model without GPU accelerators, so the INT8 model must be running on CPU. The CPU got released 2018 so that may explain why the quantized model is that slow...

Cortex-A55 may be a bit old architecture. I am not very familiar with the details of the CPU architecture, but I think Coretex-A7x may have faster inference because of the implementation of faster operations with Neon instructions. Performance seems to vary considerably depending on whether Arm NN can be called from TFLite.

Katsuya Hyodo · Answer 25 · Sun Mar 26 2023 14:54:05 GMT+0800 (China Standard Time)

Here is a video of me running an INT8 quantized SSD on a RaspberryPi4 CPU (Debian 64bit) alone in 2020.
https://www.youtube.com/watch?v=bd3lTBAYIq4
RaspberryPi4 (CPU only) + Python3.7 + Tensorflow Lite + MobileNetV2-SSDLite + Sync + MP4 640x360
15FPS (about 66ms/pred)

Mike · Answer 26 · Mon Mar 27 2023 02:30:24 GMT+0800 (China Standard Time)

Sorry, I have no idea what I did wrong last time, when I run:

onnx2tf -i yolox_nano_no_scatternd.onnx -cotof -cotoa 1e-4 -onimc /head/Concat_6_output_0

But you are right @PINTO0309, everything looks alright up to that operation:

ONNX and TF output value validation started =========================================
...
INFO: onnx_output_name: /head/Sigmoid_4_output_0 tf_output_name: tf.math.sigmoid_4/Sigmoid:0 shape: (1, 1, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Sigmoid_5_output_0 tf_output_name: tf.math.sigmoid_5/Sigmoid:0 shape: (1, 80, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Concat_2_output_0 tf_output_name: tf.concat_15/concat:0 shape: (1, 85, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Reshape_2_output_0 tf_output_name: tf.reshape_2/Reshape:0 shape: (1, 85, 169) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Concat_6_output_0 tf_output_name: tf.concat_16/concat:0 shape: (1, 85, 3549) dtype: float32 validate_result:  Matches

Mike · Answer 27 · Mon Mar 27 2023 02:35:50 GMT+0800 (China Standard Time)

What seems to differ are the output of the Multiply head operations for some reason. Not sure if these error are large enough for breaking the model completely? But given that I tried @motokimura's suggestion and it didn't work, I guess so...

INFO: onnx_output_name: /head/Mul_output_0 tf_output_name: tf.math.multiply_9/Mul:0 shape: (1, 3549, 2) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.000156402587890625
INFO: onnx_output_name: /head/Mul_1_output_0 tf_output_name: tf.math.multiply_11/Mul:0 shape: (1, 3549, 2) dtype: float32 validate_result:  Unmatched  max_abs_error: 0.000579833984375
INFO: onnx_output_name: /head/Div_output_0 tf_output_name: tf.math.divide/truediv:0 shape: (1, 3549, 2) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Div_1_output_0 tf_output_name: tf.math.divide_1/truediv:0 shape: (1, 3549, 2) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: output tf_output_name: tf.concat_17/concat:0 shape: (1, 3549, 85) dtype: float32 validate_result:  Matches

Mike · Answer 28 · Mon Mar 27 2023 02:48:13 GMT+0800 (China Standard Time)

I compiled the benchmark binary for android_arm64. The device has a Exynos9810 which is arm 64-bit. It contains a Mali-G72MP18 GPU. However, I am running the model without GPU accelerators, so the INT8 model must be running on CPU. The CPU got released 2018 so that may explain why the quantized model is that slow...

Cortex-A55 may be a bit old architecture. I am not very familiar with the details of the CPU architecture, but I think Coretex-A7x may have faster inference because of the implementation of faster operations with Neon instructions. Performance seems to vary considerably depending on whether Arm NN can be called from TFLite.

Apparently the benchmark binary can be run with nnapi delegate by --use_nnapi=true and with GPU delegate by --use_gpu=true (source). This will give a better understanding of how this model actually performs with hardware accelerators. If anybody is interested I can upload those results as well 😄

Katsuya Hyodo · Answer 29 · Mon Mar 27 2023 10:01:07 GMT+0800 (China Standard Time)

I am very interested. Probably other engineers besides myself as well.

Today and tomorrow will involve travel to distant places for work, which will slow down research and work.

Incidentally, Motoki seems to have succeeded in maintaining accuracy with INT8 quantization.

Motoki Kimura · Answer 30 · Mon Mar 27 2023 14:52:30 GMT+0800 (China Standard Time)

I'm going to share how I quantized the nano model tonight. I’ve not yet done qualitative evaluation of the quantized model, but the detection result looks OK.

Motoki Kimura · Answer 31 · Mon Mar 27 2023 17:51:00 GMT+0800 (China Standard Time)

@mikel-brostrom
This repository explains how I quantized the nano model. I hope you find this helpful!
https://github.com/motokimura/yolox-ti-lite_tflite

Note that my model doesn’t include post-process (ONNX model was exported without --export-det).

I compared the inference results from yolox_nano_ti_lite_integer_quant.tflite and ONNX models for some sample images, and confirmed the errors are acceptably small.

Motoki Kimura · Answer 32 · Mon Mar 27 2023 17:57:23 GMT+0800 (China Standard Time)

@mikel-brostrom
As for the accuracy degradation of your static quantized int8 model, I'm concerned your calibration setting might not be correct.

In calibration, representative images called calibration data is input to the model in order to observe the activation value range of each layer. Based on the observed activation range, the quantization parameters (scale and offset) which are used to map fp32 activations into int8 are computed for each layer (all of these were done in onnx2tf).
So, if the calibration data is not correct, these quantization parameters are not computed properly, resulting catastrophic accuracy degradation of the quantized model.

Since YOLOX models expects unnormalized pixel values from 0 to 255 as the input, I generated calibration data from COCO train images without normalization [code link]. Then, I passed it to onnx2tf with -qcind option without normalization as written in README:

onnx2tf -i yolox_nano_ti_lite.onnx -oiqt -qcind images calib_data_416x416_n200.npy "[[[[0,0,0]]]]" "[[[[1,1,1]]]]"

Did you pass calibration data to onnx2tf like I did?
If -qcind is not specified, onnx2tf seems to use sample calibration data as described here. This sample calibration data seems to be normalized so that the pixel values are from 0 to 1 as written here and to be further normalized ImageNet mean and std. As YOLOX models do not expect such normalized pixel values, this causes the problem in the calibration.

Motoki Kimura · Answer 33 · Mon Mar 27 2023 17:59:33 GMT+0800 (China Standard Time)

Btw, the reason why dynamic int8 calibration worked is because the dynamic quantization does not use any calibration data; the quantization parameters are adjusted for each input dynamically (so it’s called dynamic quantization in contrast to static quantization) as PINTO explained above:

If I remember correctly, dynamic range is less prone to accuracy degradation because it recalculates the quantization range each time; compared to INT8 full quantization, the inference speed would have been very slow in exchange for maintaining accuracy.

Mike · Answer 34 · Mon Mar 27 2023 23:03:39 GMT+0800 (China Standard Time)

Sorry for my late reply. I spent most of the day creating the benchmark result plot for yolox on the specific hardware I am using. I added delegate results as well. hexagon is skipped as the target device has no qualcomm chip. INT8 models don't get a boost on this chip due to the lack of an INT8 ISA. GPU boosts make sense as the EXYNOS9810 contains a Mali-G72MP18 GPU, but inference speed is quite similar to using XNNPACK with 4 threads.

Any idea why the memory footprint for the GPU delegate is so big compared to the others? Specially for the quantized one?

Exynos 9810 (ARM Mali-G72MP18 GPU). Released: March 01, 2018

Exynos 7870 (ARM Mali-T830 MP2 GPU). Released: February 17, 2016

Mike · Answer 35 · Mon Mar 27 2023 23:08:43 GMT+0800 (China Standard Time)

@mikel-brostrom As for the accuracy degradation of your static quantized int8 model, I'm concerned your calibration setting might not be correct.

In calibration, representative images called calibration data is input to the model in order to observe the activation value range of each layer. Based on the observed activation range, the quantization parameters (scale and offset) which are used to map fp32 activations into int8 are computed for each layer (all of these were done in onnx2tf). So, if the calibration data is not correct, these quantization parameters are not computed properly, resulting catastrophic accuracy degradation of the quantized model.

Since YOLOX models expects unnormalized pixel values from 0 to 255 as the input, I generated calibration data from COCO train images without normalization [code link]. Then, I passed it to onnx2tf with -qcind option without normalization as written in README:
onnx2tf -i yolox_nano_ti_lite.onnx -oiqt -qcind images calib_data_416x416_n200.npy "[[[[0,0,0]]]]" "[[[[1,1,1]]]]"
Did you pass calibration data to onnx2tf like I did? If -qcind is not specified, onnx2tf seems to use sample calibration data as described here. This sample calibration data seems to be normalized so that the pixel values are from 0 to 1 as written here and to be further normalized ImageNet mean and std. As YOLOX models do not expect such normalized pixel values, this causes the problem in the calibration.

I won't have time to check this out today @motokimura. But will report back tomorrow with my findings 😄. Thanks again for your time and guidance

Mike · Answer 36 · Tue Mar 28 2023 22:10:34 GMT+0800 (China Standard Time)

I tried a complete model export (including --export-det) following @motokimura's instructions. I am aware of the fact that the post-processing step induces large errors on INT quantized models as showed here: #269 (comment). Despite of all this I decided to proceed to check what performance I would get, as I want to do as little post-processing outside of the model as possible. These are my results:

Model	size	mAP^val 0.5:0.95	mAP^val 0.5	size	xywh output	calibration images
YOLOX-TI-nano ONNX (original model)	416	0.261	0.418	8.7M	[0, 416]	N/A
YOLOX-TI-nano ONNX (no ScatterND)	416	0.261	0.418	8.7M	[0, 416]	N/A
YOLOX-nano TFLite FP32	416	0.261	0.418	8.7M	[0, 416]	N/A
YOLOX-nano TFLite FP16	416	0.261	0.418	4.4M	[0, 416]	N/A
YOLOX-nano TFLite full_integer_quant	416	0	0	2.4M	[0, 1]	0
YOLOX-nano TFLite full_integer_quant	416	0.039	0.115	2.4M	[0, 1]	200
YOLOX-nano TFLite full_integer_quant	416	0.033	0.098	2.4M	[0, 1]	600
YOLOX-nano TFLite dynamic_range_quant	416	0.259	0.416	2.4M	[0, 1]	200
YOLOX-nano TFLite dynamic_range_quant	416	0.259	0.416	2.4M	[0, 1]	600
YOLOX-nano TFLite integer_quant	416	0.039	0.115	2.4M	[0, 1]	200
YOLOX-nano TFLite integer_quant	416	0.033	0.098	2.4M	[0, 1]	600
YOLOX-nano TFLite integer_quant	416	0	0	2.4M	[0, 416]	200

Sorry for all the experiment results I am dropping here. I hope they can help somebody going through a similar kind of processes. Without the --export-det I get the same results as @motokimura 😄

Mike · Answer 37 · Tue Mar 28 2023 22:26:37 GMT+0800 (China Standard Time)

Why would a multiplication operation be problematic when INT quantizing? #269 (comment)

Katsuya Hyodo · Answer 38 · Tue Mar 28 2023 22:46:40 GMT+0800 (China Standard Time)

Errors of less than 1e-3 hardly make any difference to the accuracy of the model. Errors introduced by Mul can be caused by slight differences in fraction handling between ONNX and TensorFlow. Ignoring it will only cause a difference that is not noticeable to the human eye.

Mike · Answer 39 · Tue Mar 28 2023 22:47:53 GMT+0800 (China Standard Time)

Then something else must be wrong in what I am doing... Will double check tomorrow

Katsuya Hyodo · Answer 40 · Tue Mar 28 2023 22:59:09 GMT+0800 (China Standard Time)

I explained it in a very simplified manner because it would be very complicated to explain in detail. You need to understand how onnx2tf checks the final and intermediate outputs.

Once you understand the principles of the accuracy checker, you will realize that minor errors can always occur, even if the model transformation is perfectly normal.

ONNX is NCHW and TensorFlow is NHWC.
Therefore, the intermediate outputs of the model will always be inconsistent with the shape of the tensor.
When comparing the output of ONNX and TensorFlow, the absolute error of the tensor is measured by forcing it to conform to the tensor shape of ONNX.
Errors below 1e-4 can occur in almost any model due to differences in rounding, truncation, and rounding up criteria between ONNX's internal processing and TensorFlow's internal processing.

Therefore, when comparing model accuracy, it is best to make sure that the final output is Matches.

Final output

INFO: onnx_output_name: output tf_output_name: tf.concat_17/concat:0 shape: (1, 3549, 85) dtype: float32 validate_result:  Matches

The reason why onnx2tf dares to have the ability to compare the errors of all operations is that onnx2tf sometimes makes mistakes in the way it transposes from NCHW to NHWC. This is an auxiliary function to quickly find out where unacceptable errors occur in order to make a final check for errors in the tool's conversion results by visual inspection.
Also, this tool does not have the ability to check INT8 accuracy, only Float32 accuracy. Therefore, it should be noted that whether or not Unmached appears is the result of the precision check in Float32, regardless of whether it was quantized to INT8 or not.

However, I am very concerned about the zero mAP in the last benchmark result. 👀

Mike · Answer 41 · Tue Mar 28 2023 23:26:38 GMT+0800 (China Standard Time)

Errors below 1e-4 can occur in almost any model due to differences in rounding, truncation, and rounding up criteria between ONNX's internal processing and TensorFlow's internal processing.

Get it!

Also, this tool does not have the ability to check INT8 accuracy, only Float32 accuracy. Therefore, it should be noted that whether or not Unmached appears is the result of the precision check in Float32, regardless of whether it was quantized to INT8 or not.

Good to know 😄

However, I am very concerned about the zero mAP in the last benchmark result. eyes

Will double check everything tomorrow just to make sure there are no errors on my side

Katsuya Hyodo · Answer 42 · Wed Mar 29 2023 11:37:45 GMT+0800 (China Standard Time)

A workaround has been implemented to avoid ScatterND shape mismatch errors as much as possible. In v1.8.3, the conversion succeeds as is even if ScatterND is included, and the accuracy check has been improved to no problem.

However, since NMS is included in the post-processing, accuracy verification with random data does not display very good results. For an accurate accuracy check, it is better to use a still image of the assumption used in the inference. This is because accuracy checks using random data may result in zero final output data counts.

https://github.com/PINTO0309/onnx2tf/releases/tag/1.8.3

onnx2tf -i xxx.onnx

In any case, ScatterND converts to a very verbose OP, so it is still better to create a model that replaces it with Slice as much as possible.

Mike · Answer 43 · Wed Mar 29 2023 16:35:35 GMT+0800 (China Standard Time)

In any case, ScatterND converts to a very verbose OP, so it is still better to create a model that replaces it with Slice as much as possible.

I may try this out later today 😄

Mike · Answer 44 · Wed Mar 29 2023 16:52:46 GMT+0800 (China Standard Time)

Also, this tool does not have the ability to check INT8 accuracy, only Float32 accuracy. Therefore, it should be noted that whether or not Unmached appears is the result of the precision check in Float32, regardless of whether it was quantized to INT8 or not.

I guess I have no option more than to wait for #258 to get a better understanding if in the INT8 model is the problem at all. The only difference between @motokimura's INT8 model and mine is:

In the model, just before the output, I do:

if self.int8:
    xy = torch.div(xy, 416)
    wh = torch.div(wh, 416)
        
outputs = torch.cat([xy, wh, outputs[..., 4:]], dim=-1)

My inference look like this:

im_batch = im_batch.cpu().numpy()
input = self.input_details[0]
int8 = (input['dtype'] == np.int8 or input['dtype'] == np.uint8)  # is TFLite quantized uint8 model
if int8:
    print('True')
    scale, zero_point = input['quantization']
    im_batch = (im_batch / scale + zero_point).astype(np.int8)  # de-scale
self.interpreter.set_tensor(input['index'], im_batch)
self.interpreter.invoke()
y = []
for output in self.output_details:
    x = self.interpreter.get_tensor(output['index'])
    if int8:
        scale, zero_point = output['quantization']
        x = ((x.astype(np.float32) - zero_point) * scale)   # re-scale
    x[0:4] = x[0:4] * 416 # notice xywh in the model is divided by 416
    y.append(x)  # de-normalize output

Mike · Answer 45 · Wed Mar 29 2023 17:34:43 GMT+0800 (China Standard Time)

Some eval results on first 8 COCO images, just to speed up the comparison process

Model	size	mAP^val 0.5:0.95	mAP^val 0.5	size	xywh model output	calibration images
YOLOX-TI-nano TFLite FP32	416	0.390	0.653	8.7M	[0, 1]	N/A
YOLOX-TI-nano TFLite FP16	416	0.390	0.653	4.4M	[0, 1]	N/A
YOLOX-TI-nano TFLite full_integer_quant	416	0.135	0.356	2.4M	[0, 1]	200
YOLOX-TI-nano TFLite full_integer_quant_with_int16_act	416	0	0	2.4M	[0, 1]	200
YOLOX-TI-nano TFLite dynamic_range_quant	416	0.389	0.652	2.4M	[0, 1]	200
YOLOX-TI-nano TFLite integer_quant	416	0.135	0.356	2.4M	[0, 1]	200
YOLOX-TI-nano TFLite integer_quant_with_int16_act	416	0.389	0.672	2.4M	[0, 1]	200

full_integer_quant_with_int16_act gives me ValueError: Cannot set tensor: Got value of type FLOAT32 but expected type INT16 for input 0, name: serving_default_images:0. This is not the case for integer_quant_with_int16_act. Taking the input and .astype(np.int16)-ing it gives 0

Motoki Kimura · Answer 46 · Wed Mar 29 2023 18:24:49 GMT+0800 (China Standard Time)

@mikel-brostrom
Thanks for sharing your results! #269 (comment)
The accuracy degradation because of the decoder is interesting..

You may find something if you compare the fp32/int8 TFLite final outputs.
Even without onnx2tf's new feature, you can do it by saving output arrays into npy files and then compare them.

The figure below is the one when I quantized YOLOv3.
Left shows the distribution of x channel, and right shows the distribution of w channel.
Orange is float, and blue is quantized.

In YOLOv3 case above, w channel has large quantization error.
If you can visualize the output distribution like this, we may find which channel (x, y, w, h, and/or, class) causes this accuracy deguradation.

Katsuya Hyodo · Answer 47 · Wed Mar 29 2023 18:35:07 GMT+0800 (China Standard Time)

Just a hunch on my part, but if you do not Concat at the end, maybe there will be no accuracy degradation. I will have to try it out to find out. In the first place, I feel that the difference in value ranges is too large. Then Concat may not be relevant.

Ref: #269 (comment)

By the way, _int16_act seems to be an experimental implementation of TFLite, so there are still many bugs or unsupported OPs.
https://www.tensorflow.org/lite/performance/post_training_integer_quant_16x8

TensorFlow Lite now supports converting activations to 16-bit integer values
and weights to 8-bit integer values during model conversion from TensorFlow 
to TensorFlow Lite's flat buffer format. We refer to this mode as the "16x8 quantization mode".
This mode can improve accuracy of the quantized model significantly, 
when activations are sensitive to the quantization, while still achieving almost 3-4x reduction 
in model size. Moreover, this fully quantized model can be consumed by integer-only hardware accelerators.

Mike · Answer 48 · Wed Mar 29 2023 19:16:22 GMT+0800 (China Standard Time)

Just a hunch on my part, but if you do not Concat at the end, maybe there will be no accuracy degradation.

Trying this out right away

Mike · Answer 49 · Wed Mar 29 2023 19:24:37 GMT+0800 (China Standard Time)

@PINTO0309 🚀 ! I just implemented what you explained here: #269 (comment). What is the rationale behind this?

Model	size	mAP^val 0.5:0.95	mAP^val 0.5	size	xywh model output	calibration images
YOLOX-TI-nano TFLite FP32	416	0.390	0.653	8.7M	[0, 1]	N/A
YOLOX-TI-nano TFLite FP16	416	0.390	0.653	4.4M	[0, 1]	N/A
YOLOX-TI-nano TFLite full_integer_quant	416	0.362	0.641	2.4M	[0, 1]	200
YOLOX-TI-nano TFLite full_integer_quant_with_int16_act	416	0	0	2.4M	[0, 1]	200
YOLOX-TI-nano TFLite dynamic_range_quant	416	0.389	0.652	2.4M	[0, 1]	200
YOLOX-TI-nano TFLite integer_quant	416	0.362	0.641	2.4M	[0, 1]	200
YOLOX-TI-nano TFLite integer_quant_with_int16_act	416	0.389	0.672	2.4M	[0, 1]	200

Katsuya Hyodo · Answer 50 · Wed Mar 29 2023 19:34:08 GMT+0800 (China Standard Time)

I was looking at the table over here. #269 (comment)

INT8 can only hold values in the range 0-255 (or -128-+128). Therefore, if we merge a flow that wants to express values in the range 0 to 1 with a flow that wants to express values in the range 0 to 416, I feel that almost all elements in the one that wants to express the range 0 to 1 will diverge to approximate 0.

Therefore, we cannot rule out the possibility that this is the problem, but we believe that if there is an earlier part that Concat and goes to the trouble of merging into 85 channels, then the problem may occur in all of them. So I have a feeling that if each flow with a significantly different value range is processed as separate flows without merging them all the way through, it would work.

All of this is only my imagination, as I have not actually tested it by moving it around at hand.

Mike · Answer 51 · Wed Mar 29 2023 19:42:23 GMT+0800 (China Standard Time)

Output looks like this now;

Katsuya Hyodo · Answer 52 · Wed Mar 29 2023 19:43:37 GMT+0800 (China Standard Time)

The position of Dequantize has obviously changed.

I am also interested in the quantization range for this area.

Mike · Answer 53 · Wed Mar 29 2023 19:52:07 GMT+0800 (China Standard Time)

In/out quantization from top-left to bottom-right of the operations you pointed at:

quantization: -3.1056954860687256 ≤ 0.00014265520439948887 * q ≤ 4.674383163452148
quantization: -3.1056954860687256 ≤ 0.00014265520439948887 * q ≤ 4.674383163452148

quantization: -2.3114538192749023 ≤ 0.00010453650611452758 * q ≤ 3.4253478050231934
quantization: 0.00014265520439948887 * q

quantization: -2.2470905780792236 ≤ 0.00011867172725033015 * q ≤ 3.888516426086426
quantization: 0.00014265520439948887 * q

quantization: 0.00014265520439948887 * q
quantization: -3.1056954860687256 ≤ 0.00014265520439948887 * q ≤ 4.674383163452148

Katsuya Hyodo · Answer 54 · Wed Mar 29 2023 19:54:24 GMT+0800 (China Standard Time)

It looks fine to me.

Mike · Answer 55 · Wed Mar 29 2023 19:54:54 GMT+0800 (China Standard Time)

Going for a full COCO eval now 🚀

Motoki Kimura · Answer 56 · Wed Mar 29 2023 20:08:56 GMT+0800 (China Standard Time)

Great! 🚀🚀

Mike · Answer 57 · Wed Mar 29 2023 20:15:42 GMT+0800 (China Standard Time)

Great that we get this into YOLOv8 as well @motokimura! Thank you both for this joint effort ❤️

Model	size	mAP^val 0.5:0.95	mAP^val 0.5	size	calibration images
YOLOX-TI-nano TFLite FP32	416	0.261	0.418	8.7M	N/A
YOLOX-TI-nano TFLite INT8	416	0.242	0.408	2.4M	200
YOLOX-TI-nano TFLite INT8	416	0.243	0.408	2.4M	800

Katsuya Hyodo · Answer 58 · Wed Mar 29 2023 20:23:19 GMT+0800 (China Standard Time)

congratulations! 👍

Katsuya Hyodo · Answer 59 · Thu Mar 30 2023 09:30:46 GMT+0800 (China Standard Time)

I will close this issue once the original problem has been solved and the INT8 quantization problem seems to have been resolved.

Mike · Answer 60 · Fri Mar 31 2023 17:09:34 GMT+0800 (China Standard Time)

Sorry for bothering you again but one thing is still unclear to me. Even when bringing the xy, wh, probs values to [0, 1] and then quantizing the model with a single output:

results are much worse than using separate xy, wh, probs outputs like this:

From our lengthy discussion I recall this:

Therefore, if we merge a flow that wants to express values in the range 0 to 1 with a flow that wants to express values in the range 0 to 416, I feel that almost all elements in the one that wants to express the range 0 to 1 will diverge to approximate 0.

and this:

In TFLite quantization, activation is quantized in per-tensor manner. That is, the OR distribution of xywh and scores, (min, max) = (0.0, 416.0), is mapped to integer values of (min, max) = (0, 255) after the Concat. As a result, even if the score is 1.0, after quantization it is mapped to: int(1.0 / 416 * 255) = int(0.61) = 0, resulting in all scores being zero!

Which makes total sense to me. Specially given the disparity in the different ranges within the same output. But why are the quantization results much worse for the model with a single output given that the values have the same range for all values? Does this make sense to you?

Model	size	mAP^val 0.5:0.95	mAP^val 0.5	size	calibration images
YOLOX-TI-nano SINGLE OUTPUT	416	0.064	0.240	2.4M	8
YOLOX-TI-nano TFLite XY, WH, PROBS OUTPUT	416	0.242	0.408	2.4M	8

Katsuya Hyodo · Answer 61 · Fri Mar 31 2023 17:39:22 GMT+0800 (China Standard Time)

There is no part of the model left to explain in more detail than Motoki's explanation, but again, take a good look at the quantization parameters around the final output of the model. I think you can see why Concat is a bad idea.

All 1.7974882125854492 * (q + 128)

The values diverge when inverse quantization (Dequantize) is performed.

onnx2tf -i yolox_nano_no_scatternd.onnx -oiqt -qt per-tensor

Perhaps that is why TI used ScatterND.

Motoki Kimura · Answer 62 · Fri Mar 31 2023 19:06:59 GMT+0800 (China Standard Time)

In your inference code posted in this comment,

x[0:4] = x[0:4] * 416 # notice xywh in the model is divided by 416

The first dim of x should be batch dim, I think.

However, this should decrease the accuracy of float models as well..

Mike · Answer 63 · Fri Mar 31 2023 19:23:43 GMT+0800 (China Standard Time)

Yup, sorry @motokimura, that's a typo. It is

outputs[:, :, 0:4] = outputs[:, :, 0:4] * 416

Motoki Kimura · Answer 64 · Fri Mar 31 2023 19:40:31 GMT+0800 (China Standard Time)

I have no idea what is happening in Concat..

As I posted, you may find something if you compare the distribution of outputs from float/int8 models.

Motoki Kimura · Answer 65 · Fri Mar 31 2023 20:53:31 GMT+0800 (China Standard Time)

@mikel-brostrom
Can you check what happens if you apply clipping to xy and wh before Concat?

if self.int8:
    xy = torch.div(xy, 416)
    wh = torch.div(wh, 416)
    # clipping
    xy = torch.clamp(xy, min=0, max=1)
    wh = torch.clamp(wh, min=0, max=1)
        
outputs = torch.cat([xy, wh, outputs[..., 4:]], dim=-1)

Assumption: xy and/or wh may have a few outliers which make quantization range much wider than we expected. Especially wh can have such outliers because Exp is used as activation function.

Mike · Answer 66 · Sun Apr 02 2023 02:57:49 GMT+0800 (China Standard Time)

Good point @motokimura. Reporting back on Monday 😊

Mike · Answer 67 · Mon Apr 03 2023 21:25:35 GMT+0800 (China Standard Time)

Interesting. It actually made it worse...

Model	size	mAP^val 0.5:0.95	mAP^val 0.5	size	calibration images
YOLOX-TI-nano TFLite XY, WH, PROBS OUTPUT	416	0.242	0.408	2.4M	8
YOLOX-TI-nano SINGLE OUTPUT	416	0.062	0.229	2.4M	8
YOLOX-TI-nano SINGLE OUTPUT (Clamped xywh)	416	0.028	0.103	2.4M	8

Motoki Kimura · Answer 68 · Tue Apr 04 2023 09:08:46 GMT+0800 (China Standard Time)

At this point I have no idea more than this comment about the quantization of Concat and what kind of quantization errors are happening inside actually.. This Concat is not necessary by nature and has no benefit for the model quantization, so I think we don't need go any deeper with this.

All I can say at this point is that tensors with very different value ranges should not be concatenated, especially in post-processing of the model.

Thank you for doing the experiment and sharing your results!

Mike · Answer 69 · Tue Apr 04 2023 14:23:42 GMT+0800 (China Standard Time)

This Concat is not necessary by nature and has no benefit for the model quantization, so I think we don't need go any deeper with this.

Agree, let's close this. Enough experimentation on this topic 😄 . Again, thank you both @motokimura, @PINTO0309 for time and guidance during this quantization journey. I learnt a lot, hopefully you got something out of the experiment results posted here as well 🙏