This repository provides a guide on how to fine-tune a text-to-speech (TTS) model using a dataset that includes emotion labels for the speech. Fine-tuning a TTS model with emotion labels can enable the generation of emotionally expressive speech.
The IEMOCAP dataset is used for fine-tuning, consists of speech samples paired with corresponding emotion labels. Each sample in the dataset includes a text prompt and the corresponding audio waveform, along with an emotion label indicating the intended emotional expression of the speech.
The base TTS model used for fine-tuning is microsoft/speecht5_tts, a state-of-the-art text-to-speech model.
The text_to_speech notebook was trained on kaggle, for reproducing the results or for running, fork my notebook 👉 notebook
Find the fine tuned model on huggingface hub 👉 model