Group Contextualization for Video Recognition
This is an official implementaion of paper "Group Contextualization for Video Recognition", which has been accepted by CVPR 2022. Paper link
Updates
March 11, 2022
- Release this V1 version (the version used in paper) to public. Complete codes and models will be released soon.
Content
Prerequisites
The code is built with following libraries:
- PyTorch >= 1.7, torchvision
- tensorboardx
For video data pre-processing, you may need ffmpeg.
Data Preparation
For GC-TSN, GC-GST, GC-TSM, we need to first extract videos into frames for all datasets (Kinetics-400, Something-Something V1 and V2, Diving48 and EGTEA Gaze+), following the TSN repo. While for GC-TDN, the data process follows the backbone TDN work, which resizes the short edge of video to 320px and directly decodes video mp4 file during training/evaluation.
Code
GC-TSN/TSM/GST/TDN codes are based on TSN, TSM, GST and TDN codebases, respectively.
Pretrained Models
Here we provide some of the pretrained models.
Kinetics-400
Model | Frame * view * clip | Top-1 Acc. | Top-5 Acc. | Checkpoint |
---|---|---|---|---|
GC-TSN ResNet50 | 8 * 1 * 10 | 75.2% | 92.1% | link |
GC-TSM ResNet50 | 8 * 1 * 10 | 75.4% | 91.9% | link |
GC-TSM ResNet50 | 16 * 1 * 10 | 76.7% | 92.9% | link |
GC-TSM ResNet50 | 16 * 3 * 10 | 77.1% | 92.9% | link |
GC-TDN ResNet50 | 8 * 3 * 10 | 77.3% | 93.2% | link |
GC-TDN ResNet50 | 16 * 3 * 10 | 78.8% | 93.8% | link |
GC-TDN ResNet50 | (8+16) * 3 * 10 | 79.6% | 94.1% |
Something-Something
Something-Something V1&V2 datasets are highly temporal-related. Here, we use the 224×224 resolution for performance report.
Something-Something-V1
Model | Frame * view * clip | Top-1 Acc. | Top-5 Acc. | Checkpoint |
---|---|---|---|---|
GC-GST ResNet50 | 8 * 1 * 2 | 48.8% | 78.5% | link |
GC-GST ResNet50 | 16 * 1 * 2 | 50.4% | 79.4% | link |
GC-GST ResNet50 | (8+16) * 1 * 2 | 52.5% | 81.3% | |
GC-TSN ResNet50 | 8 * 1 * 2 | 49.7% | 78.2% | link |
GC-TSN ResNet50 | 16 * 1 * 2 | 51.3% | 80.0% | link |
GC-TSN ResNet50 | (8+16) * 1 * 2 | 53.7% | 81.8% | |
GC-TSM ResNet50 | 8 * 1 * 2 | 51.1% | 79.4% | link |
GC-TSM ResNet50 | 16 * 1 * 2 | 53.1% | 81.2% | link |
GC-TSM ResNet50 | (8+16) * 1 * 2 | 55.0% | 82.6% | |
GC-TSM ResNet50 | (8+16) * 3 * 2 | 55.3% | 82.7% | |
GC-TDN ResNet50 | 8 * 1 * 1 | 53.7% | 82.2% | link |
GC-TDN ResNet50 | 16 * 1 * 1 | 55.0% | 82.3% | link |
GC-TDN ResNet50 | (8+16) * 1 * 1 | 56.4% | 84.0% |
Something-Something-V2
Model | Frame * view * clip | Top-1 Acc. | Top-5 Acc. | Checkpoint |
---|---|---|---|---|
GC-GST ResNet50 | 8 * 1 * 2 | 61.9% | 87.8% | link |
GC-GST ResNet50 | 16 * 1 * 2 | 63.3% | 88.5% | link |
GC-GST ResNet50 | (8+16) * 1 * 2 | 65.0% | 89.5% | |
GC-TSN ResNet50 | 8 * 1 * 2 | 62.4% | 87.9% | link |
GC-TSN ResNet50 | 16 * 1 * 2 | 64.8% | 89.4% | link |
GC-TSN ResNet50 | (8+16) * 1 * 2 | 66.3% | 90.3% | |
GC-TSM ResNet50 | 8 * 1 * 2 | 63.0% | 88.4% | link |
GC-TSM ResNet50 | 16 * 1 * 2 | 64.9% | 89.7% | link |
GC-TSM ResNet50 | (8+16) * 1 * 2 | 66.7% | 90.6% | |
GC-TSM ResNet50 | (8+16) * 3 * 2 | 67.5% | 90.9% | |
GC-TDN ResNet50 | 8 * 1 * 1 | 64.9% | 89.7% | link |
GC-TDN ResNet50 | 16 * 1 * 1 | 65.9% | 90.0% | link |
GC-TDN ResNet50 | (8+16) * 1 * 1 | 67.8% | 91.2% |
Diving48
Model | Frame * view * clip | Top-1 Acc. | Checkpoint |
---|---|---|---|
GC-GST ResNet50 | 16 * 1 * 1 | 82.5% | link |
GC-TSN ResNet50 | 16 * 1 * 1 | 86.8% | link |
GC-TSM ResNet50 | 16 * 1 * 1 | 87.2% | link |
GC-TDN ResNet50 | 16 * 1 * 1 | 87.6% | link |
EGTEA Gaze
Model | Frame * view * clip | Split1 | Split2 | Split3 |
---|---|---|---|---|
GC-GST ResNet50 | 8 * 1 * 1 | 65.5% | 61.6% | 60.6% |
GC-TSN ResNet50 | 8 * 1 * 1 | 66.4% | 64.6% | 61.4% |
GC-TSM ResNet50 | 8 * 1 * 1 | 66.5% | 66.1% | 62.6% |
GC-TDN ResNet50 | 8 * 1 * 1 | 65.0% | 61.8% | 61.0% |
Train
Test
We provided several examples to train TSM with this repo:
Contributors
GC codes are jointly written and owned by Dr. Yanbin Hao and Dr. Hao Zhang.
Citing
@article{gc2022,
title={Group Contextualization for Video Recognition},
author={Yanbin Hao, Hao Zhang, Chong-Wah Ngo, Xiangnan He},
journal={CVPR 2022},
}
Acknowledgement
Thanks for the following Github projects: