6th place solution to the Google - Isolated Sign Language Recognition Competition

Authors : Theo Viel, Christof Henkel

Introduction - Adapted from Kaggle

Our solution is a 2 model ensemble of a MLP-encoder-frame-transformer model. We pushed our transformer models close to the limit and implemented a lot of tricks to climb up to 6th place.

Preprocessing & Model

Preprocessing

Remove frames without fingers
Stride the sequence (use 1 every n frames) such that the sequence size is <= max_len. We used max_len=25 and 80 in the final ensemble
Normalization is done for the whole sequence to have 0 mean and 1 std. We do an extra centering before the specific MLP. Nan values are set to 0

Embedding

2x 1D convolutions (k=5) to smooth the positions
Embed the landmark id and type (e.g. lips, right hand, ...), embed_dim=dense_dim=16

Feature extractor

One MLP combining all the features, and 4 for specific landmark types (2x hands, face, lips)
Max aggregation for hands is to take into account that signers use one hand
dim=192, dropout=0.25

Transformer

Deberta was better than Bert, but we had to rewrite the attention layer for it to be efficient
To reduce the number of parameters, we use a smaller first transformer, and modify the output layer to upscale/downscale the features. This was key to enable blending 2 models
d_in=512, δ=256 for max_len=25, δ=64 for max_len=80, num_heads=16, dropout=0.05 for the first layer, 0.1 for the other 2
Unfortunately, we did not have any luck with using a pre-trained version of Deberta, for example by importing some of the pretrained weights

Training strategy

Augmentations:

Horizontal Flip (p=0.5)
Rotate around (0, 0, 0) by an angle between -60 and 60° (p=0.5)
Resizing by a factor in (0.7, 1.3) (p=0.5). We also allow for distortion (p=0.5)
Crop 20% of the start or end (p=0.5)
Interpolate to fill missing values (p=0.5)
Manifold Mixup [1] (scheduled, p=0.5 * epoch / (0.9 * n_epochs)) : randomly apply mixup to the features before one of the transformer layer
Only during the first half of the training, since it improved convergence
- Fill the value of the missing hand with those of the existing one (p=0.25)
- Face CutMix : replace the face landmarks with those of another signer doing the same sign (p=0.25)

Training

100 epochs, lr=3e-4, 25% warmup, linear schedule
Cross entropy with smoothing (eps=0.3)
weight_decay=0.4, batch_size=32 (x8 GPUs)
OUSM [2], i.e. exclude the top k (k=3) samples with the highest loss from the computation
Mean teacher [3] & Knowledge Distillation (see image above). We train 3 models at the same time, and use the distilled one for inference
Model soup [4] of the last 10 epochs checkpoints

How to use the repository

Prerequisites

Clone the repository
Requirements :
- pip install -r requirements.txt
- Bunch of stuff that doesn't really matter that much
Download the data :
- Put the competition data from Kaggle in the input folder

Run The pipeline

Data preparation

Prepare the data using Preparation_torch.ipynb. For our best submissions, we used max_len=25 (folder name is torch_12) and max_len=80 (folder name is torch_18).

Training

Train models using train_ddp.sh, specify the config in the src/main.py file. We used 8x V100 but training can be done on lower-end GPUs.

You can keep the parameters in the Config class as-is. To reproduce our best submissions, use the following parameters :

max_len = 25 -> dense_dim = 512, transfo_dim = 1536, distill_transfo_dim=768
max_len = 80 -> dense_dim = 256, transfo_dim = 1024, distill_transfo_dim=576

We provide model configs, weights and predictions in the logs folder.

Evaluation & inference

Evaluate your models using Validation.ipynb. The second part of the notebook runs the inference on the validation data, and is used to evaluate distilled models with (or without) model soup. Conversion of the models to tflite is done using Inference_Ens.ipynb.

Code structure

If you wish to dive into the code, the repository naming should be straight-forward. Each function is documented. The structure is the following :

src
├── data
│   ├── dataset.py              # Dataset class
│   ├── loader.py               # Dataloader
│   ├── preparation.py          # Data preparation
│   └── transforms.py           # Augmentations
├── inference           
│   ├── main.py                 # Main file
│   └── predict.py              # Predict function
├── model_zoo 
│   └── models.py               # Model
├── tflite                        
│   ├── deberta.py              # Custom deberta for efficient inference
│   ├── models.py               # Model for tflite conversion
│   └── prepro.py               # Preprocessing in Tensorflow
├── training      
│   ├── losses.py               # Loss functions
│   ├── main.py                 # k-fold main
│   ├── optim.py                # Model optimization
│   └── train.py                # Training functions
├── utils
│   ├── logger.py               # Logging utils
│   ├── metrics.py              # Metrics for the competition
│   ├── plot.py                 # Plotting utils
│   └── torch.py                # Torch utils
├── main.py                     # Training script
└── params.py                   # Main parameters

WHO777 / kaggle_islr

6th place solution to the Google - Isolated Sign Language Recognition Competition

Introduction - Adapted from Kaggle

Preprocessing & Model

Preprocessing

Embedding

Feature extractor

Transformer

Training strategy

Augmentations:

Training

How to use the repository

Prerequisites

Run The pipeline

Data preparation

Training

Evaluation & inference

Code structure

About

Languages