mmaaz60 / mvits_for_class_agnostic_od

[ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

aligning image text pairs

nikky4D opened this issue · comments

I have a question on the paper: you train on aligned image-text pairs. How do you create this alignment? is it the same way as in MDeTr? I did not fully understand from the paper, especially for non-natural images like satellite images or medical images.

Hi @nikky4D,

Thank you for your interest in our work. Yes, the image-text aligned training is performed in the same way as MDETR. However, please note that the models have not been trained on the out-of-domain datasets (such as DOTA, KITTI, Clipart, Comic, and Watercolor) on which the class-agnostic object detection is evaluated (Refer. Table 2 in the paper).

Thank you for the response. So to understand, you've trained on the datasets used in mdetr (flickr, vqa, coco), and evaluated on DOTA, KITTI with text queries like "all objects", "all small objects"?

Yes, you are right. The detected boxes from the different queries are then combined for the evaluation.

Thank you. And am I right that the queries used for the out-of-domain datasets are only those listed in Appendix A.2?

Yes, your understanding is correct.

Thank you again for the response, the code and the paper.