Sign Language Prediction with Two-Stream CNNs

Video action recognition with two-stream CNN's using MS-ASL dataset.

Overview

The model consists of two CNNs, one using a single RGB frame and another one using stacked grayscale optical flow images generated from the video. These models are fused before the last fully-connected layer. (Early-fusion)

Data

Detailed description of the dataset can be found in the official paper.

Hyper-Parameters

Learning rate: 0.001
Number of epochs: 32
Batch size: 64
Loss function: Cross entropy loss
Optimizer: Adam

Network Architecture

Two-stream CNN architecture is the one proposed in the paper by Karen Simonyan and Andrew Zisserman. Only difference is the usage of early fusion instead of late fusion.

About

Sign language recognition using MS-ASL dataset.

Languages

Language:Jupyter Notebook 100.0%