ZhenZHAO / awesome-video-moment-retrieval

paper list on Video Moment Retrieval (VMR), or Natural Language Video Localization (NLVL), or Temporal Sentence Grounding in Videos (TSGV))

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome-video-moment-retrieval

A personal paper list on Video Moment Retrieval (VMR), or Natural Language Video Localization (NLVL), or Temporal Sentence Grounding in Videos (TSGV)), Natural Language Query (NLQ).

  • Keywords: moment retrieval, temporal grounding, video/language/moment grounding/localization, sentence grounding, etc.

1 Papers List

Summarized by,

2 Quick references

Survey

Datasets

Dataset Video Source Domain
TACoS Kitchen Cooking
Charades-STA Homes Indoor Activity
ActivityNet Captions Youtube Open
DiDeMo Flickr Open
MAD, CVPR22 Movie Open

Referring to this paper, more info,

Dataset Video # VL-pair# --> train val Test Vocab Size
ActivityNet Captions 14926 37421 17505 17031 15406
TACoS 127 10146 4589 4083 2255
DiDeMo 10642 33005 4180 4021 7523
Charades-STA 6670 12404 - 3720 1289

Normally, top three is widely used. Then processed feature,

Visual: 1) by 3D ConvNet, e.g. C3D, I3D 2) by 2D ConvNet, e.g. vgg

Text: 1) pretained word embeddings, e.g. GloVe 2) pre-trained language models, e.g. BERT

NEW MAD: both by CLIP.

extracted features can be downloaded from

Process

vmr-pipeline

Performance Comparisons

vmr-pipeline

3 Resources

About

paper list on Video Moment Retrieval (VMR), or Natural Language Video Localization (NLVL), or Temporal Sentence Grounding in Videos (TSGV))