NNNNAI/video-long-term-feature-banks

Long-Term Feature Banks for Detailed Video Understanding

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, Ross Girshick
In CVPR 2019. [Paper]

This is a Caffe2 based implementation for our CVPR 2019 paper on Long-Term Feature Banks (LFB). LFB provides supportive information extracted over the entire span of a video, to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D CNNs with an LFB yields state-of-the-art results on AVA, EPIC-Kitchens, and Charades.

Data Preparation and Installation

Please see DATASET.md, INSTALL.md for instructions.

Training and Inference

Please see GETTING_STARTED.md for details.

Results

The following documents a collection of models trained with this repository. Links to the trained models and output log files are provided. The performance is evaluated on the validation set.

Note that all models here are not the original models used in paper, but reproduced by this code base. The reproduced performance reported here is very close to (or slightly better than) what's reported in paper, but not exactly the same due to the stochastic nature of training.

AVA

config	backbone	method	mAP	model id	model
ava_r50_baseline	R50-I3D-NL	3D CNN	22.2	102760666	`model`
ava_r50_lfb_avg	R50-I3D-NL	LFB-Avg	23.3	103505104	`model`, `lfb model`
ava_r50_lfb_max	R50-I3D-NL	LFB-Max	23.9	103505159	`model`, `lfb model`
ava_r50_lfb_nl	R50-I3D-NL	LFB-NL-2L	25.8	102824705	`model`, `lfb model`
ava_r50_lfb_nl_3l	R50-I3D-NL	LFB-NL-3L	25.9	106403526	`model`, `lfb model`
ava_r101_baseline	R101-I3D-NL	3D CNN	23.2	102760714	`model`
ava_r101_lfb_nl_3l	R101-I3D-NL	LFB-NL-3L	26.9 (multi-crop: 27.7)	105206523	`model`, `lfb model`

EPIC Kitchens Verb

config	backbone	method	top1	top5	model id	model
epic_verb_r50_baseline	R50-I3D-NL	3D CNN	50.7	81.1	103704809	`model`
epic_verb_r50_lfb_avg	R50-I3D-NL	LFB-Avg	52.9	82.5	103777391	`model`, `lfb model`
epic_verb_r50_lfb_max	R50-I3D-NL	LFB-Max	53.3	81.0	103777432	`model`, `lfb model`
epic_verb_r50_lfb_nl	R50-I3D-NL	LFB-NL	52.3	81.8	103777046	`model`, `lfb model`

EPIC Kitchens Noun

config	backbone	method	top1	top5	model id	model
epic_noun_r50_baseline	R50-I3D-NL	3D CNN	26.2	51.0	104421642	`model`
epic_noun_r50_lfb_avg	R50-I3D-NL	LFB-Avg	29.1	56.3	103875866	`model`
epic_noun_r50_lfb_max	R50-I3D-NL	LFB-Max	32.0	56.5	103875899	`model`
epic_noun_r50_lfb_nl	R50-I3D-NL	LFB-NL	29.5	55.4	103706990	`model`

EPIC Kitchens Action

config	backbone	method	top1	top5
epic_verb_r50_baseline & epic_noun_r50_baseline	R50-I3D-NL	3D CNN	19.4	38.1
epic_verb_r50_lfb_avg & epic_noun_r50_lfb_avg	R50-I3D-NL	LFB-Avg	21.2	41.3
epic_verb_r50_lfb_max & epic_noun_r50_lfb_max	R50-I3D-NL	LFB-Max	22.9	41.2
epic_verb_r50_lfb_nl & epic_noun_r50_lfb_nl	R50-I3D-NL	LFB-NL	21.8	40.5

Note: To make action predictions, we combine a verb model and a noun model, as opposed to training a separate action model. Performance in this table is computed using the verb/noun models from the tables above. Please see GETTING_STARTED.md for instructions on how to do this.

Charades

config	backbone	method	mAP	model id	model
charades_r50_baseline	R50-I3D-NL	3D CNN	38.3	102766107	`model`
charades_r50_lfb_avg	R50-I3D-NL	LFB-Avg	38.4	102999065	`model`, `lfb model`
charades_r50_lfb_max	R50-I3D-NL	LFB-Max	38.6	102999121	`model`, `lfb model`
charades_r50_lfb_nl	R50-I3D-NL	LFB-NL	40.3	100866795	`model`, `lfb model`
charades_r101_baseline	R101-I3D-NL	3D CNN	40.4	103560426	`model`
charades_r101_lfb_avg	R101-I3D-NL	LFB-Avg	40.8	103676713	`model`, `lfb model`
charades_r101_lfb_max	R101-I3D-NL	LFB-Max	41.0	103676788	`model`, `lfb model`
charades_r101_lfb_nl	R101-I3D-NL	LFB-NL	42.5	103641815	`model`, `lfb model`

License

Video-long-term-feature-banks is Apache 2.0 licensed, as found in the LICENSE file.

Citation

@inproceedings{lfb2019,
  Author    = {Chao-Yuan Wu and Christoph Feichtenhofer and Haoqi Fan
               and Kaiming He and Philipp Kr\"{a}henb\"{u}hl and
               Ross Girshick},
  Title     = {{Long-Term Feature Banks for Detailed Video Understanding}},
  Booktitle = {{CVPR}},
  Year      = {2019}}

NNNNAI / video-long-term-feature-banks