Speech De-warping

PyTorch implementation of our paper "Unsupervised Pre-training for Data-Efficient Text-to-Speech on Low Resource Languages", ICASSP 2023.

Abstract: Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired transcribed data required to train the model for the target downstream TTS task. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones, which may allow the model to learn proper temporal assignment relation between input and output sequences. In addition, we propose a data augmentation method that further improves the data efficiency in fine-tuning. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios, achieving outstanding performance compared to competing methods. The code and audio samples are available at: https://github.com/cnaigithub/SpeechDewarping

The code is based on the Tacotron 2 repository.

Installation

We tested our code in Ubuntu 20.04, CUDA 11.1 and Python 3.7.11 enviroment with A6000 GPUs.

conda create -n dewarp python=3.7.11
conda activate dewarp
pip install -r requirements.txt
pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Dataset

For the unsupervised pre-training, we use speech data of 'train-clean-100' subset of the LibriTTS dataset. To fine-tune the model with the transcribed speech, we use the KSS dataset for Korean and the LJspeech dataset for English. The filelists of the datasets can be found in ./filelists.

For custom datasets, follow the given filelist format for each line of the file.

Pre-training: {Audio file path}|{Audio duration in seconds}
Fine-training: {Audio file path}|{Text}

Training

For each training scheme, refer to the explanation of the hyperparameter options in ./hparams.py and set the options accordingly. Example configuration files for each scheme are provided in ./filelists/example_hparams.

# Unsupervised pre-training with speech data (Speech de-warping)
python train.py -o {Output folder to save checkpoints and logs}

# Fine-tuning with transcribed speech data
python train.py -o {Output folder to save checkpoints and logs} -c {Path of pre-trained checkpoint} --warm_start

Inference

After fine-tuning, the checkpoint can be used for TTS inference.

python inference.py -c {Path to fine-tuned checkpoint} -o {output folder to save audio results} -t {filelist containing text to inference}

Citation

@inproceedings{park2023icassp,
  title={Unsupervised Pre-training for Data-Efficient Text-to-Speech on Low Resource Languages},
  author={Park, Seongyeon and Song, Myungseo and Kim, Bohyung and Oh, Tae-Hyun},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2023},
  organization={IEEE}
}

smksyj / SpeechDewarping