sangminwoo / Explore-And-Match

Official pytorch implementation of "Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LVTR-CLIP

aries-young opened this issue · comments

Hello, I'm very interested in your LVTR-CLIP model. The features ectracted by CLIP only include image information, whereas the features ectracted by C3D include both image information and video termporal information. So, why does LVTR-CLIP outperform LVTR-C3D ? Or, in another word, cross_model encoder has capability to modeling the temporal relations both between frame and frame, frame and text ?

@aries-young, we appreciate your interest in our work.

Since we employed a pre-trained CLIP, the image-text features extracted from LVTR-CLIP are already somewhat aligned.
Then, those features are concatenated and given as an input sequence to the cross-modal encoder.
The cross-modal encoder models pair-wise relations between every input token, which are 1) frame-frame (temporal), 2) frame-text (cross-modality), and 3) text-text (temporal).

@aries-young, we appreciate your interest in our work.

Since we employed a pre-trained CLIP, the image-text features extracted from LVTR-CLIP are already somewhat aligned. Then, those features are concatenated and given as an input sequence to the cross-modal encoder. The cross-modal encoder models pair-wise relations between every input token, which are 1) frame-frame (temporal), 2) frame-text (cross-modality), and 3) text-text (temporal).

Thank you very much. This is a very encouraging discovery.