microsoft / XPretrain

Multi-modality pre-training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About the zero-shot performance

LiuRicky opened this issue · comments

Thanks for you interesting work.

I am curious about the zero-shot performance of your CLIP-ViP on MSR-VTT.

I find that models (e.g. videoCLIP, SimVLP) pre-trained on video-text pairs performs not satisfied compare with image-language countparters(e.g. CLIP, BLIP) on zero-shot transfer to video retrival tasks. How about the zero-shot performance on CLIP-ViP?

Which do you think make this phenomenon happen?

I have the ZS result of the 1-epoch post-pretrained CLIP-ViP (B/32): R@1: 31.5 R@5: 53.9 R@10: 63.4
The result is close to CLIP's. One reason is the captions of MSR-VTT have very similar forms to image captions, which are all descriptive text. Another reason is that a wide range of video-language benchmarks do not heavily rely on the understanding of temporality [1]. As a result, the zero-shot performance of an image-language model is already good. However, our results show that post-pretraining can improve the fine-tuning results by a large margin, benefiting from the good representation learned from video-language data.

[1] Buch, Shyamal, et al. "Revisiting the" video" in video-language understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

Could you provide a possible ZS result for CLIP-VIP(B/16)? I would be very grateful