There are 4 repositories under video-text-retrieval topic.
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
We introduce temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of Multimodal foundation models (MFMs). This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across QA, captioning, and retrieval tasks.
【CVPR'2023 Highlight & TPAMI】Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Story-Based Retrieval with Contextual Embeddings. Largest freely available movie video dataset. [ACCV'20]
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
An official implementation for "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"
Research Code for Multimodal-Cognition Team in Ant Group
CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations, ICCV 2021
[Pattern Recognition 2025] Cross-Modal Adapter for Vision-Language Retrieval
[EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Text from the video is extracted and saved into a .docx file in the form of notes.