moment-retrieval natural-language-queries temporal-grounding video-grounding video-moment-retrieval

Awesome-video-moment-retrieval

A personal paper list on Video Moment Retrieval (VMR), or Natural Language Video Localization (NLVL), or Temporal Sentence Grounding in Videos (TSGV)), Natural Language Query (NLQ).

Keywords: moment retrieval, temporal grounding, video/language/moment grounding/localization, sentence grounding, etc.

1 Papers List

Summarized by,

2 Quick references

Survey

视频片段检索研究综述, 软件学报，2020
A survey on temporal sentence grounding in videos. in ArXiv 2021
The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions. in ArXiv 2022

Datasets

Dataset	Video Source	Domain
TACoS	Kitchen	Cooking
Charades-STA	Homes	Indoor Activity
ActivityNet Captions	Youtube	Open
DiDeMo	Flickr	Open
MAD， CVPR22	Movie	Open

Referring to this paper, more info,

Dataset	Video #	VL-pair# --> train	val	Test	Vocab Size
ActivityNet Captions	14926	37421	17505	17031	15406
TACoS	127	10146	4589	4083	2255
DiDeMo	10642	33005	4180	4021	7523
Charades-STA	6670	12404	-	3720	1289

Normally, top three is widely used. Then processed feature,

Visual: 1) by 3D ConvNet, e.g. C3D, I3D 2) by 2D ConvNet, e.g. vgg

Text: 1) pretained word embeddings, e.g. GloVe 2) pre-trained language models, e.g. BERT

NEW MAD: both by CLIP.

extracted features can be downloaded from

https://github.com/microsoft/VideoX/tree/master/MS-2D-TAN

Process

Performance Comparisons

3 Resources

About

paper list on Video Moment Retrieval (VMR), or Natural Language Video Localization (NLVL), or Temporal Sentence Grounding in Videos (TSGV))

moment-retrieval natural-language-queries temporal-grounding video-grounding video-moment-retrieval