Video Action Recognition using ConvNet + LSTM

This repository contains the code for the video action recognition trained on UCF-101 dataset.

Preprocessing

We split the video into the chunks of 16 frames with the stride of 8. We train the model on all of the chunks of the video and in test mode average the predictions from all chunks for 1 video. This is a form of a temporal data augmentation for video classification and helps to generalize better.

Architecture

We've implemented ConvNet + LSTM model for action recognition like here

We use ResNet-18 pretrained on ImageNet model from torchvision as a feature extractor. The last fully connected layer for classification is removed and we use 512 featured for each frame taken from the activation of the last ResNet layer. The features for 16 frames are then fed to many-to-one Batch-Norm LSTM which performs the final classification.

KupynOrest / ldsss17_project

Video Action Recognition using ConvNet + LSTM

Preprocessing

Architecture

Dependencies

About

Languages