cantabile-kwok / VAD-LSTM

Perform Voice Activity Detection with LSTM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VAD Project Task2: VAD Based on BiLSTM and MFCC

cantabile-kwok

Code Environment & Required Packages

Code tested on the following environments:

  • Linux 3.10.0
  • CUDA 10.1
  • Python 3.6.13

Required Python Packages:

  • PyTorch 1.4.0
  • NumPy 1.19.2
  • sklearn 0.24.2
  • spafe 0.1.0
  • librosa 0.8.0
  • matplotlib 3.3.4

Note: spafe may be a niche package. Use pip install spafe to install it (requires SciPy and librosa). See https://spafe.readthedocs.io/en/latest/ for document of this package.


Guidelines for Running the Code

Similar to ESPNet recipes, we have explicit stages for running the code.

  • stage 0: Preparing data and extracting feature. After this stage, features are stored in feats/
  • stage 1: Model training. After this stage, the models are stored as epoch_{}.pth. The RoC curve is stored in fig/
  • stage 2: Model evaluating.
  • stage 3: Model predicting (generates test_label_task2.txt)

For example, to start from raw wave files, use

python main.py --stage 0

Note that his stage only has to be run once.

Other arguments include:

  • --fs: sampling rate
  • --win_len, --win_hop frame size and shift in seconds
  • --hidden_size, --num_layers: LSTM hyperparameters
  • --lr: learning rate
  • --num_epoch: number of epochs
  • --report_interval: number of batches of each report. E.g. if set to 50, then training loss will be printed to console every 50 batches.
  • --L: adjust length in evaluating VACC when calculating SBA, EBA, according to Evaluating VAD for Automatic Speech Recognition, ICSP2014, Sibo Tong et al.

Note: The number of epochs is set to 2 by default. We use the last epoch to evaluate and predict after training. If you want to specify this, modify line 357.


Project Structure

vad (root)

├──data

│ ├──dev_label.txt, train_label.txt

│ ├──dev_label_dict.json (存储dev中的音频和对应label的字典)

│ ├──train_label_dict.json (同上)

│ └──test_label_task2.txt (预测结果)

├──feats

│ ├──dev

│ ├──test

│ └──train

├──fig (可视化的结果,包括一些样本的预测值和真值对比、RoC曲线)

├──latex, wavs文件夹

├──__init__.py, evaluate.py, vad_utils.py

├──main.py (主函数)

├──epoch_0.pth, epoch_1.pth等(保存模型)

├──script (可sbatch提交之)

└──README.md,README.html

About

Perform Voice Activity Detection with LSTM


Languages

Language:Python 63.2%Language:TeX 36.4%Language:Shell 0.4%