microsoft / RegionCLIP

[CVPR 2022] Official code for "RegionCLIP: Region-based Language-Image Pretraining"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could you tell me what is the meanning of these performance metrics?

JiuqingDong opened this issue · comments

I run the code with a default setting and I got the result as follows:
python3 ./tools/train_net.py
--num-gpus 3
--config-file ./configs/COCO-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_ovd.yaml
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50.pth
MODEL.CLIP.OFFLINE_RPN_CONFIG ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_C4_1x_ovd_FSD.yaml
MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_coco_48.pth
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_48_base_cls_emb.pth
MODEL.CLIP.OPENSET_TEST_TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_65_cls_emb.pth \

[03/28 10:32:16 d2.evaluation.coco_evaluation]: Preparing results for COCO format ...
[03/28 10:32:17 d2.evaluation.coco_evaluation]: Saving results to ./output/inference/coco_instances_results.json
...
...
[03/28 10:32:31 d2.evaluation.coco_evaluation]: Evaluation results for bbox:
| AP | AP50 | AP75 | APs | APm | APl |
| 31.625 | 49.910 | 33.474 | 15.806 | 34.558 | 42.920 |
[03/28 10:32:31 d2.evaluation.coco_evaluation]: AP50_split_target AP: 0.2994920059113552
[03/28 10:32:31 d2.evaluation.coco_evaluation]: AP50_split_base AP: 0.5697890511639334
[03/28 10:32:31 d2.evaluation.coco_evaluation]: AP50_split_all AP: 0.49909597779018217
[03/28 10:32:31 d2.evaluation.coco_evaluation]: Per-category bbox AP:
| category | AP | category | AP | category | AP |
| person | 53.623 | bicycle | 28.973 | car | 37.887 |
...
...
| scissors | 9.270 | toothbrush | 14.776 | | |
[03/28 10:32:32 d2.engine.defaults]: Evaluation results for coco_2017_ovd_all_test in csv format:
[03/28 10:32:32 d2.evaluation.testing]: copypaste: Task: bbox
[03/28 10:32:32 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[03/28 10:32:32 d2.evaluation.testing]: copypaste: 31.6248,49.9096,33.4735,15.8060,34.5576,42.9197

I want to know what is the meaning of AP50_split_target, AP50_split_base AP, AP50_split_all AP and the AP, AP50 in last line.
image
I am not sure that AP50_split_target is Novel_17 or Generalized_novel?
What is the difference between a Novel and novel in Generalized setting?

All metrics in the log are for the Generalized setting. AP50_split_target is Novel_generalized. To get the results of Novel, the concept embedding will be limited to Novel categories. In other words, the detectors will only predict Novel categories.

Thank you for your reply! I still have another question.
You said you use an off-the-shelf object localizer (e.g., RPN). The RPN was trained with the base categories of LVIS dataset. Is that frozen in the pretraining process? Did you train the RPN in any process?

Yes, once RPN is trained (e.g., LVIS), it's frozen and will not be updated. You can consider it an independent module that focuses on localization and our RegionCLIP focuses on region recognition.

Hi, I got another question. As you mentioned, the RPN is frozen and will not be updated. Then, you get a region-text pair by CLIP model. That means you can classify a region by CLIP, and you can locate an object by using CLIP. In other words, you already slove two questions in object detection. So, what is the meaning of your visual encoder?

CLIP itself doesn't support localizing objects. That's why we use external localizers, such as RPN. Even combining a localizer with CLIP, the performance of region recognition is still unsatisfactory (refer to 2nd paragraph in the Introduction of the paper). That's the main motivation for this work, can we build a region-level vision-language recognition model?