tarun360 / ISCAP_Height_Estimation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ISCAP - Identifying Speaker Characteristics through Audio Profiling - HEIGHT ESTIMATION

Introduction

Installation:

Setting up environment

  1. Install Kaldi
git clone -b 5.4 https://github.com/kaldi-asr/kaldi.git kaldi
cd kaldi/tools/; make; cd ../src; ./configure; make
  1. Install EspNet
git clone -b v.0.9.7 https://github.com/espnet/espnet.git
cd espnet/tools/        # change to tools folder
ln -s {kaldi_root}      # Create link to Kaldi. e.g. ln -s home/theanhtran/kaldi/
  1. Set up Conda environment
./setup_anaconda.sh anaconda espnet 3.7.9   # Create a anaconda environmetn - espnet with Python 3.7.9
make TH_VERSION=1.8.0 CUDA_VERSION=10.2     # Install Pytorch and CUDA
. ./activate_python.sh; python3 check_install.py  # Check the installation
conda install torchvision==0.9.0 torchaudio==0.8.0 -c pytorch
  1. Install Pytorch Lightning
conda install pytorch-lightning -c conda-forge

Download the project

  1. Clone the project from GitHub into your workspace
git clone https://github.com/TonnyTran/ISCAP_Height_Estimation.git
cd ISCAP_Height_Estimation
ln -s {kaldi_root}/egs/wsj/s5/utils     # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/utils
ln -s {kaldi_root}/egs/wsj/s5/steps     # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/steps 
  1. Point to your espnet

Open ISCAP_Height_Estimation/path.sh file, change $MAIN_ROOT$ to your espnet directory, e.g. MAIN_ROOT=/home/theanhtran/espnet

How to run Height Estimation systems

  1. Data preparation step
bash prepare_TIMIT_data.sh

This step will download .zip file of TIMIT dataset => extract and then generate features using Kaldi format

  1. Run the program
bash run_height_estimation.sh program     # $program in {1, 2, 3, 4} indicates which program you want to run   
  • program=1 => Model 1: LSTM + Cross_Attention + MSE_Loss | FBank Features | Height Estimation
  • program=2 => Model 2: LSTM + Cross_Attention + Center & MSE_Loss | FBank Features | Height Estimation
  • program=3 => Model 3: LSTM + Cross_Attention + Triplet & MSE_Loss | FBank Features | Height Estimation
  • program=4 => Model 4: LSTM + Cross_Attention + MAE_Loss | FBank Features | MultiTask Estimation (both age & height)
  1. Test the trained model
bash test_height_model.sh program     # $program in {1, 2, 3, 4} 

We can control the test program by input program number in {1, 2, 3, 4}.

Other instructions:

  • You may change the hyper-parameters such as the batch_size, max_epochs, early_stopping_patience, learning_rate, num_layers, loss_criterion, etc. in the run.py file of any model.
  • Please note that the if you are not using a GPU for processing, change the hyper-parameter of gpu in the trainer function (in the run.py files) to 0.

Models & Results:

This document is to compile the summary of all the models for height estimation using TIMIT dataset.
We predominantly use below feature extraction for these models:

  • Filter Bank: 80 FBank + 3 Pitch + 1 Binary_Gender (Features_Dimension: 83)

Moreover, we use 3 data augmentations for our data:

  • CMVN: Cepstral mean and variance normalization for FBank features
  • Speed Perturbation: Triple the training data using 0.9x and 1.1x speed perturbed data.
  • Spectral Augmentation: SpecAugment to randomly mask 15%-25% for better generalization and robustness.



Results:

S. No. Model Loss Height MAE All Height MAE Male Height MAE Female
1. LSTM + Cross_att Mean Squared Error (MSE) 5.38 5.46 5.22
2. LSTM + Cross_att MSE + Center Loss 5.25 5.26 5.23
3. LSTM + Cross_att MSE + Triplet Loss 5.23 5.08 5.31
4. LSTM + Cross_att MSE Age + Height 5.36 5.40 5.26
Shareef (2020) Comb3 (Fstats + formant + harmonic features (amplitude + frequency locations)) 5.2 4.8
Singh (2016) Random Forest 5.0 5.0



Model_1:

  • Model: LSTM + Cross_Attention + MSE_Loss | FBank Features | Height Estimation

  • Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MSE) loss and Adam optimizer. We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is 32.

  • Model Architecture:

Model_2:

  • Model: LSTM + Cross_Attention + Center & MSE_Loss | FBank Features | Height Estimation

  • Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MAE) loss combined with a Center Loss, used to train the embeddings obtained right after the cross_attention layer. Center loss is given one-third the weighatge in total loss while MSE is given two-thirds. Adam is used the optimizer. The height labels are quantized and classified into groups of 5cms for Center Loss (i.e. height labels from 140-145cm in class_0, 145-150cm in class_1 and so on, giving us a total of 13 classes). We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.

  • Model Architecture:



Model_3:

  • Model: LSTM + Cross_Attention + Triplet & MSE_Loss | FBank Features | Height Estimation

  • Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MAE) loss combined with a Triplet Loss, used to train the embeddings obtained right after the cross_attention layer. Triplet loss is given one-third the weighatge in total loss while MSE is given two-thirds. Adam is used the optimizer. The height labels are quantized and classified into groups of 5cms for Triplet Loss (i.e. height labels from 140-145cm in class_0, 145-150cm in class_1 and so on, giving us a total of 13 classes). We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.

  • Model Architecture:



Model_4:

  • Model: LSTM + Cross_Attention + MSE_Loss | FBank Features | MultiTask Estimation (both age & height)

  • Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MSE) loss and Adam optimizer for both age and height estimation with height_loss given twice the weight as comapred to age_loss. We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.

  • Model Architecture:



About


Languages

Language:Python 82.0%Language:Shell 15.0%Language:Perl 3.0%