Fine-tuning a Text-to-Speech Model with Emotion Labels

This repository provides a guide on how to fine-tune a text-to-speech (TTS) model using a dataset that includes emotion labels for the speech. Fine-tuning a TTS model with emotion labels can enable the generation of emotionally expressive speech.

Dataset

The IEMOCAP dataset is used for fine-tuning, consists of speech samples paired with corresponding emotion labels. Each sample in the dataset includes a text prompt and the corresponding audio waveform, along with an emotion label indicating the intended emotional expression of the speech.

Model

The base TTS model used for fine-tuning is microsoft/speecht5_tts, a state-of-the-art text-to-speech model.

Webapp

Try generating speech

Links

The text_to_speech notebook was trained on kaggle, for reproducing the results or for running, fork my notebook 👉 notebook

Find the fine tuned model on huggingface hub 👉 model

About

This is a demonstration on how to produce speech in a particular emotion from text, this is achieved by fine tuning a TTS model on emotion labelled speech data, formulating it as a multi-modal problem.

Languages

Language:Jupyter Notebook 85.4%Language:Python 14.6%