Visual Grounding

Question

Visual Grounding

prashan opened this issue 3 years ago · comments

Is this model useful for visual grounding purposes? if so how should I change it?

Wonjae Kim · Answer 1 · Wed Dec 01 2021 15:25:32 GMT+0800 (China Standard Time)

Since we do not have regional features, ViLT is not directly applicable for visual grounding tasks.
Though, you can opt to use a method like Grad-CAM to perform visual grounding.
For example, ALBEF did a weakly-supervised visual grounding on RefCOCO+ with Grad-CAM even though the model did not use regional features as ViLT did.