End-to-End (E2E) Spoken-Language-Understanding(SLU) in PyTorch
CDS 2nd year - capstone project - buiild a E2E speech to intent (S2I) model with the application of transfer learning
This repo contains modified Pytorch code adopted from Loren Lugosch. For more information, please refer to his papers "Speech Model Pre-training for End-to-End Spoken Language Understanding" and "Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models". Codes are modified for our own purposes. Please refer to the final report in this repo for more information on modifications.
If you have any questions about this code or have problems getting it to work, please send me an email at <cora.jung@nyu.edu>
.
Dependencies
PyTorch, torchaudio, numpy, soundfile, pandas, tqdm, textgrid.py
Pre-Training
First, change the asr_path
and/or slu_path
in the config file (like experiments/unfreeze_word_pretrain.cfg
, or whichever experiment you want to run) to point to where the LibriSpeech data and/or Fluent Speech Commands data are stored on your computer.
SLU training: To train the model on an SLU dataset, run the following command:
python main_pretrain.py --train --config_path=<path to .cfg>
Now the best model_state.pth should be saved in experiments/unfreeze_word_pretrain/training/model_state.pth
Fine-Tuning
We are using the pre-trained model_state.pth to fine-tune. Run the following command after changing the asr_path
and/or slu_path
in the config file (like experiments/unfreeze_word_finetune.cfg
, or whichever experiment you want to run) to point to where speech data for fine-tuning are stored in your computer.
python main_finetune.py --train --restart --config_path=<path to .cfg> --model_path=<path to .pth>
- model_path should point to the saved model_state.pth. We need to give path upto
experiments/unfreeze_word_pretrain/training/
- config_path should point to the config file that contains information about the fine-tuning model (in this example,
experiments/unfreeze_word_finetune.cfg
would serve the job. Don't forget to change theasr_path
and/orslu_path
)
ASR pre-training: Note: the experiment folders in this repo already have a pre-trained LibriSpeech model that you can use. LibriSpeech is pretty big (>100 GB uncompressed), so don't do this part unless you want to re-run the pre-training part with different hyperparameters. If you want to do this, you will first need to download our LibriSpeech alignments here, put them in a folder called "text", and put the LibriSpeech audio in a folder called "audio". To pre-train the model on LibriSpeech, run the following command:
python main_pretrain.py --pretrain --config_path=<path to .cfg>
Inference
You can perform inference with a trained SLU model by running inference.py:
python inference.py
The test.wav
file included with this repo has a recording of Loren saying "Hey computer, could you turn the lights on in the kitchen please?", and so the inferred intent should be {"activate", "lights", "kitchen"}
.
Citation
- Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, and Yoshua Bengio, "Speech Model Pre-training for End-to-End Spoken Language Understanding", Interspeech 2019.
- Loren Lugosch, Brett Meyer, Derek Nowrouzezahrai, and Mirco Ravanelli, "Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models", ICASSP 2020.