ISCAP - Identifying Speaker Characteristics through Audio Profiling - HEIGHT ESTIMATION

Introduction

Installation:

Setting up environment

Install Kaldi

git clone -b 5.4 https://github.com/kaldi-asr/kaldi.git kaldi
cd kaldi/tools/; make; cd ../src; ./configure; make

Install EspNet

git clone -b v.0.9.7 https://github.com/espnet/espnet.git
cd espnet/tools/        # change to tools folder
ln -s {kaldi_root}      # Create link to Kaldi. e.g. ln -s home/theanhtran/kaldi/

Set up Conda environment

./setup_anaconda.sh anaconda espnet 3.7.9   # Create a anaconda environmetn - espnet with Python 3.7.9
make TH_VERSION=1.8.0 CUDA_VERSION=10.2     # Install Pytorch and CUDA
. ./activate_python.sh; python3 check_install.py  # Check the installation
conda install torchvision==0.9.0 torchaudio==0.8.0 -c pytorch

Install Pytorch Lightning

conda install pytorch-lightning -c conda-forge

Download the project

Clone the project from GitHub into your workspace

git clone https://github.com/TonnyTran/ISCAP_Height_Estimation.git
cd ISCAP_Height_Estimation
ln -s {kaldi_root}/egs/wsj/s5/utils     # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/utils
ln -s {kaldi_root}/egs/wsj/s5/steps     # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/steps

Point to your espnet

Open ISCAP_Height_Estimation/path.sh file, change $MAIN_ROOT$ to your espnet directory, e.g. MAIN_ROOT=/home/theanhtran/espnet

How to run Height Estimation systems

Data preparation step

bash prepare_TIMIT_data.sh

This step will download .zip file of TIMIT dataset => extract and then generate features using Kaldi format

Run the program

bash run_height_estimation.sh program     # $program in {1, 2, 3, 4} indicates which program you want to run

program=1 => Model 1: LSTM + Cross_Attention + MSE_Loss | FBank Features | Height Estimation
program=2 => Model 2: LSTM + Cross_Attention + Center & MSE_Loss | FBank Features | Height Estimation
program=3 => Model 3: LSTM + Cross_Attention + Triplet & MSE_Loss | FBank Features | Height Estimation
program=4 => Model 4: LSTM + Cross_Attention + MAE_Loss | FBank Features | MultiTask Estimation (both age & height)

Test the trained model

bash test_height_model.sh program     # $program in {1, 2, 3, 4}

We can control the test program by input program number in {1, 2, 3, 4}.

Other instructions:

You may change the hyper-parameters such as the batch_size, max_epochs, early_stopping_patience, learning_rate, num_layers, loss_criterion, etc. in the run.py file of any model.
Please note that the if you are not using a GPU for processing, change the hyper-parameter of gpu in the trainer function (in the run.py files) to 0.

Models & Results:

This document is to compile the summary of all the models for height estimation using TIMIT dataset.
We predominantly use below feature extraction for these models:

Filter Bank: 80 FBank + 3 Pitch + 1 Binary_Gender (Features_Dimension: 83)

Moreover, we use 3 data augmentations for our data:

CMVN: Cepstral mean and variance normalization for FBank features
Speed Perturbation: Triple the training data using 0.9x and 1.1x speed perturbed data.
Spectral Augmentation: SpecAugment to randomly mask 15%-25% for better generalization and robustness.

Results:

S. No.	Model	Loss	Height MAE All	Height MAE Male	Height MAE Female
1.	LSTM + Cross_att	Mean Squared Error (MSE)	5.38	5.46	5.22
2.	LSTM + Cross_att	MSE + Center Loss	5.25	5.26	5.23
3.	LSTM + Cross_att	MSE + Triplet Loss	5.23	5.08	5.31
4.	LSTM + Cross_att	MSE Age + Height	5.36	5.40	5.26
Shareef (2020)	Comb3 (Fstats + formant + harmonic features (amplitude + frequency locations))			5.2	4.8
Singh (2016)	Random Forest			5.0	5.0

Model_1:

Model: LSTM + Cross_Attention + MSE_Loss | FBank Features | Height Estimation
Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MSE) loss and Adam optimizer. We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is 32.
Model Architecture:

Model_2:

Model: LSTM + Cross_Attention + Center & MSE_Loss | FBank Features | Height Estimation
Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MAE) loss combined with a Center Loss, used to train the embeddings obtained right after the cross_attention layer. Center loss is given one-third the weighatge in total loss while MSE is given two-thirds. Adam is used the optimizer. The height labels are quantized and classified into groups of 5cms for Center Loss (i.e. height labels from 140-145cm in class_0, 145-150cm in class_1 and so on, giving us a total of 13 classes). We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.
Model Architecture:

Model_3:

Model: LSTM + Cross_Attention + Triplet & MSE_Loss | FBank Features | Height Estimation
Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MAE) loss combined with a Triplet Loss, used to train the embeddings obtained right after the cross_attention layer. Triplet loss is given one-third the weighatge in total loss while MSE is given two-thirds. Adam is used the optimizer. The height labels are quantized and classified into groups of 5cms for Triplet Loss (i.e. height labels from 140-145cm in class_0, 145-150cm in class_1 and so on, giving us a total of 13 classes). We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.
Model Architecture:

Model_4:

Model: LSTM + Cross_Attention + MSE_Loss | FBank Features | MultiTask Estimation (both age & height)
Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MSE) loss and Adam optimizer for both age and height estimation with height_loss given twice the weight as comapred to age_loss. We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.
Model Architecture:

tarun360 / ISCAP_Height_Estimation