Multi-Modal Multi-Action Video Recognition

We release the code of the "Multi-Modal Multi-Action Video Recognition" framework proposed in our ICCV2021 paper and the extended work.

The main idea of the framework is to explore multi-action relations via utilizing multi-modal information in videos. We adopt GCN (Graph Convolutional Network) and Transformer Network as the relation learners, and also realize a multi-modal joint learning strategy for multi-action video recognition. The core implementation for the framework are lib/models/m3a_helper.py, lib/models/m3a_relation_learner.py, and lib/models/m3a_net_builder.py.

Preparation

Please follow the instructions in PREPARATION.md for installation and data preparation.

Get Started

Please refer to GETTING_STARTED.md to run the framework.

Model Zoo

The models and results are provided in MODEL_ZOO.md.

License

Please see the LICENSE file for more details.

Acknowledgement

We really appreciate the contributors of following codebases.

Citation

@inproceedings{shi2021multi,
  title={Multi-Modal Multi-Action Video Recognition},
  author={Shi, Zhensheng and Liang, Ju and Li, Qianqian and Zheng, Haiyong and Gu, Zhaorui and Dong, Junyu and Zheng, Bing},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={13678--13687},
  year={2021}
}

About

Multi-Modal Multi-Action Video Recognition

Apache License 2.0

Languages

Language:Python 99.7%Language:Shell 0.3%