This repository contains the code for the video action recognition trained on UCF-101 dataset.
We split the video into the chunks of 16 frames with the stride of 8. We train the model on all of the chunks of the video and in test mode average the predictions from all chunks for 1 video. This is a form of a temporal data augmentation for video classification and helps to generalize better.
We've implemented ConvNet + LSTM model for action recognition like here
We use ResNet-18 pretrained on ImageNet model from torchvision as a feature extractor. The last fully connected layer for classification is removed and we use 512 featured for each frame taken from the activation of the last ResNet layer. The features for 16 frames are then fed to many-to-one Batch-Norm LSTM which performs the final classification.
- Python 3+
- CPU or NVIDIA GPU + CUDA CuDNN
- Torch