InternVideo

This repo gives the official implmentation of 'InternVideo: General Video Foundation Models via Generative and Discriminative Learning', by Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jashuo Yu, Hongjie Zhang, Yali Wang, Limin Wang, and Yu Qiao.

Updates

Mar 8, 2023: All pretrained foundation model weights are released. Access them from here.
Feb 19, 2023: Some pretrained foundation model weights (-L) are released.
Feb 5, 2023: The code & model of multimodal learning are released.
Jan 18, 2023: The code of vision-language navigation is released.
Jan 16, 2023: The code of video question answering, zero-shot action recognition, and zero-shot multiple choice is released.
Jan 1, 2023: The code & model of spatio-temporal action localiztion are released.
Dec 27, 2022: The code & model of partial pretraining (VideoMAE) and downstream applications (video-text retrieval, temporal action localization, open-set action recognition, and ego4d related tasks) are released.
Dec 6, 2022: The technical report of InternVideo is released.
Sep 2, 2022: Press releases (official | 163 news | qq news).

Introduction

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively.

Code & model

Performance

Video Retrieval

Model Zoo

To access the pretrained foundation model weights and task ones, please fill out the form (or scan the below QR code) and then you will find the download link.

Citation

If this work is helpful for your research, please consider citing InternVideo.

@article{wang2022internvideo,
  title={InternVideo: General Video Foundation Models via Generative and Discriminative Learning},
  author={Wang, Yi and Li, Kunchang and Li, Yizhuo and He, Yinan and Huang, Bingkun and Zhao, Zhiyu and Zhang, Hongjie and Xu, Jilan and Liu, Yi and Wang, Zun and Xing, Sen and Chen, Guo and Pan, Junting and Yu, Jiashuo and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2212.03191},
  year={2022}
}

About

InternVideo: General Video Foundation Models via Generative and Discriminative Learning (https://arxiv.org/abs/2212.03191)

https://opengvlab.shlab.org.cn/

Apache License 2.0

Languages

Language:Python 93.1%Language:Shell 3.6%Language:Jupyter Notebook 1.0%Language:C 1.0%Language:C++ 0.6%Language:Cuda 0.5%Language:Cython 0.1%Language:Dockerfile 0.0%Language:Makefile 0.0%Language:Batchfile 0.0%