This repo gives the official implmentation of 'InternVideo: General Video Foundation Models via Generative and Discriminative Learning', by Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jashuo Yu, Hongjie Zhang, Yali Wang, Limin Wang, and Yu Qiao.
Jan 1, 2023
: The code & model of spatio-temporal action localiztion are released.Dec 27, 2022
: The code & model of partial pretraining (VideoMAE) and downstream applications (video-text retrieval, temporal action localization, open-set action recognition, and ego4d related tasks) are released.Dec 6, 2022
: The technical report of InternVideo is released.Sep 2, 2022
: Press releases (official|163 news|qq news).
- Video foundation model Pretraining.
- video masked modeling.
- video-language contrastive learning modeling.
- Supervised training of ViT (from video masked modeling) and UniformerV2 (from multimodal learning).
- Model interaction.
- Downstream tasks.
- Action recognition.
- Temporal action localization.
- Spatio-temporal action localization.
- Video-text retrieval.
- Video question answering.
- Visual language negativation.
- Open-set action recognition.
- Zero-shot action recognition.
- Zero-shot Multiple Choice.
- Ego4D related tasks.
- Pretrained foundation model weights.
- Demos for training usages and evaluations.
If this work is helpful for your research, please consider citing InternVideo.
@article{wang2022internvideo,
title={InternVideo: General Video Foundation Models via Generative and Discriminative Learning},
author={Wang, Yi and Li, Kunchang and Li, Yizhuo and He, Yinan and Huang, Bingkun and Zhao, Zhiyu and Zhang, Hongjie and Xu, Jilan and Liu, Yi and Wang, Zun and Xing, Sen and Chen, Guo and Pan, Junting and Yu, Jiashuo and Wang, Yali and Wang, Limin and Qiao, Yu},
journal={arXiv preprint arXiv:2212.03191},
year={2022}
}