Visual Grounding
prashan opened this issue · comments
prashan commented
Is this model useful for visual grounding purposes? if so how should I change it?
Wonjae Kim commented
Hi @prashan
Since we do not have regional features, ViLT is not directly applicable for visual grounding tasks.
Though, you can opt to use a method like Grad-CAM to perform visual grounding.
For example, ALBEF did a weakly-supervised visual grounding on RefCOCO+ with Grad-CAM even though the model did not use regional features as ViLT did.