microsoft / RegionCLIP

[CVPR 2022] Official code for "RegionCLIP: Region-based Language-Image Pretraining"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CLIP判别裁剪区域效果差

nhw649 opened this issue · comments

作者您好,我对您的这项工作非常感兴趣,但是我遇到一些问题想请教下您。我尝试将原图crop下来的区域图像(使用GT box进行crop)送入CLIP与COCO的80个类别计算相似度,但是反馈的结果令人不满意。比如对于一个在吃热狗的人,CLIP会将他判断为热狗(且概率非常高),但实际标签是人,而且这种误判情况非常多。您有遇到这种问题吗?请问遇到这种问题您有什么好的解决办法吗?

@nhw649 Thanks for your interest. Your finding is exactly what we demonstrated in Figure 1 of paper. Cropping image regions and feeding them to CLIP yields unsatisfactory performance (RCNN style). That's one of the major motivations of RegionCLIP.

@nhw649 Thanks for your interest. Your finding is exactly what we demonstrated in Figure 1 of paper. Cropping image regions and feeding them to CLIP yields unsatisfactory performance (RCNN style). That's one of the major motivations of RegionCLIP.

我直接使用你们提供的权重初始化CLIP编码器,这样就可以吗?

@nhw649 You can refer to region feature extraction. This script can extract features for image regions. Note that it's performed on the feature level (the context information is already encoded into region features), instead of the raw image level (not including the context information).

好的,我试试

So for image region classification of CLIP given gt boxes, which one is better? Crop and encode or encode and RoIAlign? Is there any direct evidence? We can see from Fig.1 that RCNN style classification would lead to unsatisfactory results (19.1% Accuracy on LVIS), and from Tab.4 we can see the mAP of CLIP on LVIS (using RoIAlign and gt boxes) is 42.2. But I don't think these two metrics(accuracy v.s. mAP) are the same thing, is there any further ablation on this? Thank you in advance for your time!

So for image region classification of CLIP given gt boxes, which one is better? Crop and encode or encode and RoIAlign? Is there any direct evidence? We can see from Fig.1 that RCNN style classification would lead to unsatisfactory results (19.1% Accuracy on LVIS), and from Tab.4 we can see the mAP of CLIP on LVIS (using RoIAlign and gt boxes) is 42.2. But I don't think these two metrics(accuracy v.s. mAP) are the same thing, is there any further ablation on this? Thank you in advance for your time!

I think the effect is almost the same. You can try.

@lcxrocks @nhw649 The performance of "cropping and encode" highly depends on the way you crop the image regions. For example, cropping a larger box than the object region and ensembling multiple larger boxes can help to improve performance significantly (ViLD also reveals this observation). RoIAlign, in comparison, is designed for learning region representations and encourages the network to learn local features and context information into the cropped feature area. In my experiments, they had close performance.

@nhw649 You can refer to region feature extraction. This script can extract features for image regions. Note that it's performed on the feature level (the context information is already encoded into region features), instead of the raw image level (not including the context information).

Hi, I use the extract-region-features script to extract the region features and want to use it in the following task. However, I got the region features which shape is (100, 1024) almost be the same at each dimension from the example image. What make this happen? Are the 100 features come from the highest scores from the region recognition results? Why the 100 features are the same? Thanks for your time.

@nhw649 You can refer to region feature extraction. This script can extract features for image regions. Note that it's performed on the feature level (the context information is already encoded into region features), instead of the raw image level (not including the context information).

Hi, I use the extract-region-features script to extract the region features and want to use it in the following task. However, I got the region features which shape is (100, 1024) almost be the same at each dimension from the example image. What make this happen? Are the 100 features come from the highest scores from the region recognition results? Why the 100 features are the same? Thanks for your time.

You might want to print the bounding box coordinates. In this case, all 100 regions might be the same. This is because the localizer is not well set up. Please double-check your parameters for RPN and NMS.

@nhw649 You can refer to region feature extraction. This script can extract features for image regions. Note that it's performed on the feature level (the context information is already encoded into region features), instead of the raw image level (not including the context information).

Hi, I use the extract-region-features script to extract the region features and want to use it in the following task. However, I got the region features which shape is (100, 1024) almost be the same at each dimension from the example image. What make this happen? Are the 100 features come from the highest scores from the region recognition results? Why the 100 features are the same? Thanks for your time.

You might want to print the bounding box coordinates. In this case, all 100 regions might be the same. This is because the localizer is not well set up. Please double-check your parameters for RPN and NMS.

I got the 100 different box coordinates, but the feats are the same. I use the default config in the extract_feature_extraction.sh and I got this result:
image

I use the following script:

RN50, LVIS 1203 concepts

python3 ./tools/extract_region_features.py
--config-file ./configs/LVISv1-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_zsinf.yaml
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50.pth
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/lvis_1203_cls_emb.pth
MODEL.CLIP.CROP_REGION_TYPE RPN
MODEL.CLIP.MULTIPLY_RPN_SCORE True
MODEL.CLIP.OFFLINE_RPN_CONFIG ./configs/LVISv1-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml
MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_lvis_866.pth
INPUT_DIR ./datasets/custom_images
OUTPUT_DIR ./output/region_feats
TEST.DETECTIONS_PER_IMAGE 100 \

I can't find where the problem happens.