You can build an environment with Docker
or Conda
.
If you don't have Docker installed, please follow the links to find installation instructions for Ubuntu, Mac or Windows.
Build docker image:
docker build -t emospeech .
Run docker image:
bash run_docker.sh
If you don't have Conda installed, please find the installation instructions for your OS here.
conda create -n etts python=3.10
conda activate etts
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
If you have different version of cuda on your machine you can find applicable link for pytorch installation here.
We used data of 10 English Speakers from ESD dataset. To download all .wav
, .txt
files along with .TextGrid
files created using MFA:
bash download_data.sh
To train a model we need precomputed durations, energy, pitch and eGeMap features. From src
directory run:
python -m src.preprocess.preprocess
This is how your data folder should look like:
.
├── data
│ ├── ssw_esd
│ ├── test_ids.txt
│ ├── val_ids.txt
└── └── preprocessed
├── duration
├── egemap
├── energy
├── mel
├── phones.json
├── pitch
├── stats.json
├── test.txt
├── train.txt
├── trimmed_wav
└── val.txt
- Configure arguments in
config/config.py
. - Run
python -m src.scripts.train
.
Testing is implemented on testing subset of ESD dataset. To synthesize audio and compute neural MOS (NISQA TTS):
- Configure arguments in
config/config.py
underInference
section. - Run
python -m src.scripts.test
.
You can find NISQA TTS for original, reconstructed and generated audio in test.log
.
EmoSpeech is trained on phoneme sequences. Supported phones can be found in data/preprocessed/phones.json
. This repositroy is created for academic research and doesn't support automatic grapheme-to-phoneme conversion. However, if you would like to synthesize arbitrary sentence with emotion conditioning you can:
-
Generate phoneme sequence from graphemes with MFA.
1.1 Follow the installation guide
1.2 Download english g2p model:
mfa model download g2p english_us_arpa
1.3 Generate phoneme.txt from graphemes.txt:
mfa g2p graphemes.txt english_us_arpa phoneme.txt
-
Run
python -m src.scripts.inference
, specifying arguments:
Аrgument | Meaning | Possible Values | Default value |
---|---|---|---|
-sq |
Phoneme sequence to synthesisze | Find in data/phones.json . |
Not set, required argument. |
-emo |
Id of desired voice emotion | 0: neutral, 1: angry, 2: happy, 3: sad, 4: surprise. | 1 |
-sp |
Id of speaker voice | From 1 to 10, correspond to 0011 ... 0020 in original ESD notation. | 5 |
-p |
Path where to save synthesised audio | Any with .wav extension. |
generation_from_phoneme_sequence.wav |
For example
python -m src.scripts.inference --sq "S P IY2 K ER1 F AY1 V T AO1 K IH0 NG W IH0 TH AE1 NG G R IY0 IH0 M OW0 SH AH0 N"
If result file is not synthesied, check inference.log
for OOV phones.
- FastSpeech 2 - PyTorch Implementation
- iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform
- Publicly Available Emotional Speech Dataset (ESD) for Speech Synthesis and Voice Conversion
- NISQA: Speech Quality and Naturalness Assessment
- Montreal Forced Aligner Models
- Modified VocGAN
- AdaSpeech