Zero recall value while evaluating on LMO dataset

Question

Zero recall value while evaluating on LMO dataset

supriya-gdptl opened this issue 2 years ago · comments

Supriya Gadi Patil commented 2 years ago

I tried to evaluate the GDR-Net model on LMO dataset using the pretrained models you shared on OneDrive.
I used following command to run the valuation:

python core/gdrn_modeling/main_gdrn.py --config-file configs/gdrn/lmo/a6_cPnP_AugAAETrunc_BG0.5_lmo_real_pbr0.1_40e.py \
 --num-gpus 1 \
--eval-only  \
--opts MODEL.WEIGHTS=output/gdrn/lmo/a6_cPnP_AugAAETrunc_BG0.5_lmo_real_pbr0.1_40e/gdrn_lmo_real_pbr.pth

However, it is showing zero recall values. Please see the screenshot below.
Could you please help?

Thank you,
Supriya

Gu Wang · Answer 1 · Sun Dec 04 2022 21:28:10 GMT+0800 (China Standard Time)

Maybe you should check your full running log to see where the problem is.

Supriya Gadi Patil · Answer 2 · Mon Dec 05 2022 04:59:20 GMT+0800 (China Standard Time)

The features from the backbone is a tensor of zeros. (On line 121 in GDRN.py). Because of this, all further steps output zero tensor.

features = 
tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]]], device='cuda:0')

The log says all weights (backbone, pnp_net and rot_head) from the checkpoint are loaded correctly. Still the output of backbone is zero tensor.
Did you encounter such error before? Do you know what might be causing it?
See below the log

20221204_124725|fvcore.common.checkpoint@152: [Checkpointer] Loading from D:/research/data/gdrnet_data/gdrn/lmo/a6_cPnP_AugAAETrunc_BG0.5_lmo_real_blender_160e/model_final_wo_optim.pth ...
20221204_124728|d2.checkpoint.c2_model_loading@324: Following weights matched with model:
| Names in Model                        | Names in Checkpoint                                                                       | Shapes                         |
|:--------------------------------------|:------------------------------------------------------------------------------------------|:-------------------------------|
| backbone.bn1.*                        | backbone.bn1.{bias,num_batches_tracked,running_mean,running_var,weight}                   | (64,) () (64,) (64,) (64,)     |
| backbone.conv1.weight                 | backbone.conv1.weight                                                                     | (64, 3, 7, 7)                  |
| backbone.layer1.0.bn1.*               | backbone.layer1.0.bn1.{bias,num_batches_tracked,running_mean,running_var,weight}          | (64,) () (64,) (64,) (64,)     |
| backbone.layer1.0.bn2.*               | backbone.layer1.0.bn2.{bias,num_batches_tracked,running_mean,running_var,weight}          | (64,) () (64,) (64,) (64,)     |
| backbone.layer1.0.conv1.weight        | backbone.layer1.0.conv1.weight                                                            | (64, 64, 3, 3)                 |
| backbone.layer1.0.conv2.weight        | backbone.layer1.0.conv2.weight                                                            | (64, 64, 3, 3)                 |
| backbone.layer1.1.bn1.*               | backbone.layer1.1.bn1.{bias,num_batches_tracked,running_mean,running_var,weight}          | (64,) () (64,) (64,) (64,)     |
| backbone.layer1.1.bn2.*               | backbone.layer1.1.bn2.{bias,num_batches_tracked,running_mean,running_var,weight}          | (64,) () (64,) (64,) (64,)     |
| backbone.layer1.1.conv1.weight        | backbone.layer1.1.conv1.weight                                                            | (64, 64, 3, 3)                 |
....
| pnp_net.fc1.*                         | pnp_net.fc1.{bias,weight}                                                                 | (1024,) (1024,8192)            |
| pnp_net.fc2.*                         | pnp_net.fc2.{bias,weight}                                                                 | (256,) (256,1024)              |
| pnp_net.fc_r.*                        | pnp_net.fc_r.{bias,weight}                                                                | (6,) (6,256)                   |
| pnp_net.fc_t.*                        | pnp_net.fc_t.{bias,weight}                                                                | (3,) (3,256)                   |
.....
| rot_head_net.features.0.weight        | rot_head_net.features.0.weight                                                            | (512, 256, 3, 3)               |
| rot_head_net.features.1.*             | rot_head_net.features.1.{bias,num_batches_tracked,running_mean,running_var,weight}        | (256,) () (256,) (256,) (256,) |
| rot_head_net.features.10.weight       | rot_head_net.features.10.weight                                                           | (256, 256, 3, 3)               |

Thank you,
Supriya

Supriya Gadi Patil · Answer 3 · Mon Dec 05 2022 10:01:59 GMT+0800 (China Standard Time)

Hi @wangg12,

Could you please tell which version of detectron2 you have used?
The detectron2 website link that you shared in README.md (link) is for detectron2 version 0.6.

Circled in red in the image below

Gu Wang · Answer 4 · Mon Dec 05 2022 11:33:13 GMT+0800 (China Standard Time)

Yes. But I installed from source. It seems you were running on windows, could you run the code on Ubuntu?

Supriya Gadi Patil · Answer 5 · Mon Dec 05 2022 13:33:06 GMT+0800 (China Standard Time)

Thank you for the suggestion @wangg12.

I figured out the issue.
The features from backbone were zero because the weights of backbone were zero.
The checkpoint was getting loaded correctly but for some unknown reason, Line 550 in gdrn_evaluator.py was resetting the weights to zero.

I resolved this issue by loading the checkpoint again after line 550.
I got the following result.

Could you please tell me what does each metric in the first column stand for, i.e. what does ad_2, rete_2, re_2, te_2, proj_2, re, te stand for?

Thank you,
Supriya

Gu Wang · Answer 6 · Mon Dec 05 2022 14:20:24 GMT+0800 (China Standard Time)

Here https://github.com/THU-DA-6D-Pose-Group/GDR-Net/blob/main/core/gdrn_modeling/gdrn_custom_evaluator.py#L772 you can find what those metrics mean.