onnx model can not be simpiflied and pass onnx.check and wierd output

Question

onnx model can not be simpiflied and pass onnx.check and wierd output

lucasjinreal opened this issue 3 years ago · comments

MagicSource commented 3 years ago

the onnx model exported has very wierd dimension caused it can not be simplifed or pass onnx.checker.check.

This is verbose output of export DETR:

this is verbose output of AnchorDETR:

Both are last serveral layers, as you can see, for DETR the strides seems very small

but AnchorDETR are something like Float(1, 1, 900, 91, strides=[81900, 81900, 91, 1], requires_grad=1, device=cuda:0) gaint value.

and it caused when try to check this model, or try to simplifed this model:

ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB: 3714028571

error got.

any idea?

wym · Answer 1 · Thu Oct 14 2021 18:36:13 GMT+0800 (China Standard Time)

Hi, @jinfagang

Both are last serveral layers, as you can see, for DETR the strides seems very small

but AnchorDETR are something like Float(1, 1, 900, 91, strides=[81900, 81900, 91, 1], requires_grad=1, device=cuda:0) gaint value.

I think the values of strides are normal. The strides are based on the shape. For example, if the shape is [a,b,c,d], then the stride is [bcd, cd, d, 1].

and it caused when try to check this model, or try to simplifed this model:

You do not pass the onnx.checker because of the onnx_simplifier but not the exported onnx model. The exported onnx model can pass the check of onnx.checker.check_model and I will push the code with the checker to export_onnx.py.

The size is still normal before the function eliminate_const_nodes in onnx_simplifier. But for this problem, I suggest you open an issue to the repo of onnx-simplifier.

MagicSource · Answer 2 · Thu Oct 14 2021 19:23:33 GMT+0800 (China Standard Time)

@tangjiuqi097 Yes. I already have, but the issue not only had there.

Onnx simplifier helps eliminate constant values and make whole graph model simple by calling onnxoptimize functions, in other words, without it, a transformer model not able to converted to any other framework, or, can not converted to by optimized way, which is meaningless.

I just don't know why anchordetr can not pass simplifier while DETR can.

I had tried DETR, the file size between anchordetr and detr model are almost same level, but the former can be simplified and inferenced via ONNX-runtime (after simplified).

if you try inference anchordetr onnx model you will found your result will all be Nan. which means, this model can not be correctly inference via onnxruntime (even you might not gonna get any error throws). And I think even it can infer on CPU with onnxruntime, doesn't means it can inference via GPU.

wym · Answer 3 · Thu Oct 14 2021 21:38:17 GMT+0800 (China Standard Time)

@jinfagang Hi,

if you try inference anchordetr onnx model you will found your result will all be Nan.

This problem may be the same as issue #49 in DETR and I will fix it as they do. But I am not sure if the problem of simplifier is related to it.

MagicSource · Answer 4 · Thu Oct 14 2021 21:47:42 GMT+0800 (China Standard Time)

@tangjiuqi097 thanks for you notice this. Given you export onnx make nested_tensor_list out of whole trace scope, it might not highly related with that problem. But worthy to give it a try. I am still puzzeled by can not being simplified cause without it, hard to deploy on tensorrt or tvm.

wym · Answer 5 · Fri Oct 15 2021 13:56:57 GMT+0800 (China Standard Time)

@jinfagang Now the problems are fixed by following #173 in DETR.

It is because the onnx does not support the slice assignment in nested_tensor_from_tensor_list. It will make all the regions be masked and lead to nan for the feature position and attention weight.

The problem of increased size for onnx_simplifier is disappeared after fixing the bug in nested_tensor_from_tensor_list. You can open an issue to onnx-simplifier if you are interested in the reason.

MagicSource · Answer 6 · Fri Oct 15 2021 14:24:19 GMT+0800 (China Standard Time)

@tangjiuqi097 Nice, let me have a try.

wym · Answer 7 · Fri Oct 15 2021 17:19:29 GMT+0800 (China Standard Time)

Hi,
1.

assert (np.abs(res1[0].cpu().numpy()-res2[0]).max() < 1e-5) and (np.abs(res1[1].cpu().numpy()-res2[1]).max() < 1e-5), "inaccurate results"

AssertionError: inaccurate results

It is because the 1e-5 is too strict for the pred_logits. But actually it is ok and I have updated this code.

And I can't get any detections using this onnx inference script:

Have you loaded the checkpoint to export the onnx model？

MagicSource · Answer 8 · Fri Oct 15 2021 18:29:21 GMT+0800 (China Standard Time)

Now looks normal, I will do trt acceleration later.

wym · Answer 9 · Fri Oct 15 2021 19:58:55 GMT+0800 (China Standard Time)

@jinfagang BTW, as we use the focal loss for the category loss, it should be better to use sigmoid instead of softmax to get the confidence score.

MagicSource · Answer 10 · Fri Oct 15 2021 21:45:52 GMT+0800 (China Standard Time)

@tangjiuqi097 thanks for advice.