AI-S2-Lab / ECSS

[AAAI'2024] Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ECSS

Introduction

This is an implementation of the following paper. 《Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling》 (Accepted by AAAI2024)

Rui Liu *, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li.

Demo Page

Speech Demo

Dependencies

  • For details about the operating environment dependency, see FCTalker.
  • You also need to install PyTorch Geometric(Used to support heterogeneous graph neural networks)

Dataset

  • You can download dataset from DailyTalk.
  • You can get the emotion category and emotion intensity annotation information in the ./preprocessed_data/DailyTalk/ folder.

1_1_d30|1|{Y EH1 S AY1 N OW1}|yes, i know.|none|1 The format of each piece of data is representing sentence ID|speaker|phoneme sequence|original content|emotion|emotion intensity

Preprocessing

Run

python3 prepare_align.py --dataset DailyTalk

for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DailyTalk/TextGrid/. Alternately, you can run the aligner by yourself. Please note that our pretrained models are not trained with supervised duration modeling (they are trained with learn_alignment: True).

After that, run the preprocessing script by

python3 preprocess.py --dataset DailyTalk

Training

Train your model with

python3 train.py --dataset DailyTalk

Inference

Only the batch inference is supported as the generation of a turn may need contextual history of the conversation. Try

python3 synthesize.py --source preprocessed_data/DailyTalk/val_*.txt --restore_step RESTORE_STEP --mode batch --dataset DailyTalk

to synthesize all utterances in preprocessed_data/DailyTalk/val_*.txt.

Citing

To cite this repository:

@inproceedings{liu2024emotion,
  title={Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling},
  author={Liu, Rui and Hu, Yifan and Ren, Yi and Yin, Xiang and Li, Haizhou},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={17},
  pages={18698--18706},
  year={2024}
}

Author

E-mail:hyfwalker@163.com

About

[AAAI'2024] Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling

License:MIT License


Languages

Language:Python 100.0%