LVTR-CLIP

Question

LVTR-CLIP

aries-young opened this issue 2 years ago · comments

Hello, I'm very interested in your LVTR-CLIP model. The features ectracted by CLIP only include image information, whereas the features ectracted by C3D include both image information and video termporal information. So, why does LVTR-CLIP outperform LVTR-C3D ? Or, in another word, cross_model encoder has capability to modeling the temporal relations both between frame and frame, frame and text ?

Sangmin Woo · Answer 1 · Thu May 12 2022 19:09:29 GMT+0800 (China Standard Time)

@aries-young, we appreciate your interest in our work.

Since we employed a pre-trained CLIP, the image-text features extracted from LVTR-CLIP are already somewhat aligned.
Then, those features are concatenated and given as an input sequence to the cross-modal encoder.
The cross-modal encoder models pair-wise relations between every input token, which are 1) frame-frame (temporal), 2) frame-text (cross-modality), and 3) text-text (temporal).

aries-young · Answer 2 · Fri May 13 2022 13:48:11 GMT+0800 (China Standard Time)

@aries-young, we appreciate your interest in our work.

Since we employed a pre-trained CLIP, the image-text features extracted from LVTR-CLIP are already somewhat aligned. Then, those features are concatenated and given as an input sequence to the cross-modal encoder. The cross-modal encoder models pair-wise relations between every input token, which are 1) frame-frame (temporal), 2) frame-text (cross-modality), and 3) text-text (temporal).

Thank you very much. This is a very encouraging discovery.