dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Visual Grounding

prashan opened this issue · comments

Is this model useful for visual grounding purposes? if so how should I change it?

Hi @prashan

Since we do not have regional features, ViLT is not directly applicable for visual grounding tasks.
Though, you can opt to use a method like Grad-CAM to perform visual grounding.
For example, ALBEF did a weakly-supervised visual grounding on RefCOCO+ with Grad-CAM even though the model did not use regional features as ViLT did.