ChaLearn 2021 LAP

This repository contains the code of my submission to the ChaLearn 2021 Looking At People: Signer Independent Isolated Sign Language Recognition challenge.

Track	Development Score	Test Score
RGB	0.9083	0.9292
RGB-D	0.9167	0.9332

This PyTorch code allows you to reproduce my results by training the model yourself, or by using the provided pre-trained model weights in the Releases section.

If you wish to train the models yourself, I recommend that you download the OpenPose keypoints and pose flow files from the Releases section, and use those along with the original MP4 files. This saves you the trouble of extracting them yourself using OpenPose.

The process to reproducing my results is explained below.

Set up virtual environment with requirements
Prepare the dataset folder
Download and extract the keypoints and pose flow
1. Alternatively, extract the keypoints yourself using OpenPose
Train the model
Perform inference to obtain the predictions

If you use the pre-trained model checkpoints, you can skip step 4.

Requirements

This code base has following dependencies:

Python 3.8.5
PyTorch 1.7.1
Torchvision 0.8.2 with PyAV 8.0.2
PyTorch Lightning 1.1.1
OpenCV-Python 4.3.0.36

Older and newer versions of these dependencies may work as well but are not tested.

I recommend you create a virtual environment and install the dependencies using:

python3 -m venv .env
source .env/bin/activate
pip install --upgrade pip   # Make sure you have the latest version of pip
pip install -r src/requirements.txt

Hardware

For GPUs with small memory capacity, you can use the reduce the batch size by a factor n and use the --accumulate-grad-batches n option during training. We use --batch_size 4 --accumulate-grad-batches 8 to emulate a batch size of 32.

Using more than 1 GPU is currently not supported! If you run out of VRAM, use the above option.

Dataset preparations

If you wish to reproduce our results, you will need to prepare some folders for the data. First, create the following directory hierarchy (the absolute location of project does not matter and neither does it name):

project/
project/data
project/data/mp4
project/data/kp
project/data/kpflow2

This can be done using the command

mkdir -p project/data/{mp4,kp,kpflow2}

Copy the CSV files from the data/ directory in this repository to the project/ directory. These are the prediction templates and the label files. See also data/README.md for more information.

Create a train, val and test directory under project/data/mp4.

mkdir -p project/data/mp4/{train,val,test}

Place the corresponding MP4 files there. They can be found on the competition website.

These directories will also be created under kp and kpflow2 when the corresponding feature extraction code is executed. You do not need to manually create them.

Finally, run the count_frames.py script with as argument the path to the mp4 directory, e.g.,

python count_frames.py --input_dir project/data/mp4

Your dataset is now prepared to either extract the keypoint files yourself, or use the ones provided by us.

Using prepared files

We provide keypoint and pose flow files in the Releases section to allow for easy reproduction of the results. Note that these are large archives, so I have split them in sections. You can extract them using

cat kp.tar.bz2.* | tar -jxv
cat kpflow2.tar.bz2.* | tar -jxv

Then you can proceed with the dataset preparations below without needing to run OpenPose and pose flow extraction yourself. That is, you can skip the next section and go directly to training the models.

Reproducing results from scratch

You can also reproduce our results from scratch, starting from the MP4 files. In this case, you will need to extract keypoints and pose flow yourself.

Keypoint extraction

You can extract OpenPose using the OpenPose demo and following command (modify '0,1' to match the available GPUs on your machine). First you need to install the OpenPose demo and download the BODY-135 model.

import glob
import os

all_files = glob.glob('project/data/mp4/*/*_color.mp4')

CALL_STRING = 'CUDA_VISIBLE_DEVICES={} ./openpose.bin --render_pose 0 --number_people_max 1 --display 0 --video {} --write_json {} --model_pose BODY_135'

for sample in all_files:
    out_dir = sample.replace('mp4', 'kp').replace('_color.mp4', '')
    os.makedirs(out_dir, exist_ok=True)
    c = CALL_STRING.format('0,1', sample, out_dir)
    os.system(c)

Keypoints will be available as directories of JSON files in a kp directory on the same level as the mp4 directory.

The model also requires pose flow, which can be computed from these keypoint files.

cd src
python extract_poseflow.py --input_dir project/data/kp

Pose flow will be available as .npy files in a kpflow2 directory on the same level as the mp4 and kp directories.

Training the models

For training, you should create a log directory, to which the experiment details as well as Tensorboard event files will be written. We will assume that this log directory exists at $LOG_DIR.

In our case we use 4 workers for the dataset loading, but you can set this according to your CPU's capacity. We will assume that this is set as $NUM_WORKERS.

You should download the data yourself and provide the path to our scripts, specifically the path to the .mp4 files. We will assume that this data directory exists at $DATA_DIR. So for the explanation above,

$ echo $DATA_DIR
project/data/mp4

Training (RGB)

For training on RGB data, use this command.

python -m train --log_dir $LOG_DIR --model VTN_HCPF --dataset handcrop_poseflow --num_workers $NUM_WORKERS \
    --data_dir $DATA_DIR --sequence_length 16 --temporal_stride 2 --learning_rate 1e-4 \
    --gradient_clip_val=1 --gpus 1 --cnn rn34 --num_layers 4 --num_heads 8 \
    --batch_size 4 --accumulate-grad-batches 8

Training (RGB-D)

For training on RGB-D data, use this command.

python -m train --log_dir $LOG_DIR --model VTN_HCPF_D --dataset handcrop_poseflow_rgbd --num_workers $NUM_WORKERS \
    --data_dir $DATA_DIR --sequence_length 16 --temporal_stride 2 --learning_rate 1e-4 \
    --gradient_clip_val=1 --gpus 1 --cnn rn34 --num_layers 4 --num_heads 8 \
    --batch_size 4 --accumulate-grad-batches 8

Inference

The predict.py script can be used. This requires an additional prediction template file provided by the challenge organizers. We will assume this file exists at $PREDICTION_TEMPLATE.

After training, you should have a checkpoint file at $CHECKPOINT_PATH. You can predict using

python -m predict --log_dir $LOG_DIR --model $MODEL --dataset $DATASET --num_workers $NUM_WORKERS \
    --data_dir $DATA_DIR --sequence_length 16 --temporal_stride 2 --learning_rate 1e-4 \
    --gpus 1 --cnn rn34 --num_layers 4 --num_heads 8 --max_epochs $NUM_EPOCHS \
    --checkpoint=$CHECKPOINT_PATH --submission_template $PREDICTION_TEMPLATE --out predictions.csv \
    --batch_size 4

for the $MODEL and $DATASET of your choice. Alternatively, you can use one of the pre-trained models provided in the Releases section (the command remains the same).

Prediction files

This repository also provides prediction files for the models on both the validation and the test set. These can be found under the predictions directory.

LICENCE

This code is available under the MIT licence (see LICENCE). Part of the code base is based on the Intel OpenVINO toolkit (see LICENCE_OPENVINO).

Citation

If you found this code useful, please consider citing this paper:

@InProceedings{De_Coster_2021_CVPR,
    author    = {De Coster, Mathieu and Van Herreweghe, Mieke and Dambre, Joni},
    title     = {Isolated Sign Recognition From RGB Video Using Pose Flow and Self-Attention},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2021},
    pages     = {3441-3450}
}

MaxGaller / ChaLearn-2021-LAP