LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing English prompt datasets, our corpus provides more diverse prompt annotations for all speakers of LibriTTS-R. Experimental results for prompt-based controllable TTS demonstrate that the TTS model trained with LibriTTS-P achieves higher naturalness than the model using the conventional dataset. Furthermore, the results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset.

You can check the paper and demo page.

File Details

There are files related to LibriTTS-P under the data directory. The details of each file are as follows:

df1_en.csv, df2_en.csv, df3_en.csv
- Speaker prompt data for Annotator 1, Annotator 2, and Annotator 3, respectively.
excluded_spk_list.txt
- We found that in LibriTTS-R, there are voice samples with the same spk_id that clearly have different genders. This is a text file listing those spk_ids. We recommend excluding these when using our dataset.
unannotated_spk_list.txt
- Audio files listed in "libritts_r_failed_speech_restoration_examples.tar.gz" (see LibriTTS-R cite) were excluded during the annotation for speaker prompts. As a result, there were no suitable audio files left for annotation for three speakers. Therefore, we have documented these spk_ids in this text file. We recommend excluding these speakers when using speaker prompts.
style_prompt_candidates_v230922.csv
- This file includes the style_prompt_key (e.g., M_p-low_s-slow_e-low) and the corresponding style prompt options, separated by semicolons.
- The style_prompt_key comprises four style factors:
  - Gender: M/F
  - Pitch: low/normal/high
  - Speaking speed: slow/normal/fast
  - Loudness: low/normal/high
- For example, "M_p-low_s-slow_e-low" means the following:
```
M: male
p-low: pitch is low
s-slow: speaking speed is slow
e-low: loudness is low
```

metadata_w_style_prompt_tags_v230922.csv

This file contains metadata for each audio file. For instance, by using this file along with style_prompt_candidates_v230922.csv, it is possible to refer to the style_prompt for each audio.

The details of the columns in this CSV file are as follows:

Name	Description
item_name	Name of the audio file
spk_id	Speaker ID
gender	Gender of the speaker
pitch	Pitch level of the audio
speaking_speed	Speaking speed level
energy	Energy level of the audio
content_prompt	Content prompt corresponding to the audio
style_prompt_key	Key for `style_prompt_candidates_v230922.csv`, indicating the style prompt associated with the audio.
raw_f0_mean	Average F0 of the voiced parts of the audio
raw_f0_scale	Standard deviation of the F0
raw_lf0_mean	Average of the log-F0 for the voiced parts
raw_lf0_scale	Standard deviation of the logarithm of the log-F0
raw_speaking_rate	The number of syllables per second
raw_loudness_lufs	Loudness units relative to full scale
raw_loudness_mean	Average loudness of the audio file calculated per frame, providing an average measure of the loudness over time.
raw_loudness_scale	Standard deviation of the frame loudness values, indicating the variability of loudness across the audio frames.
invalid	Flag indicating whether the utterance has been marked as invalid due to missing F0, an invalid speaking rate (e.g., speaking_rate < 0), or other processing errors. `1` means invalid, and `0` means valid.

(For detailed calculation methods of each item, please refer to LibriTTS-P paper.)

You can use audio from LibriTTS-R.

Citation

@inproceedings{librittsp,
    authors={Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana},
    title={LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning},
    booktitle={Proc. Interspeech 2024},
    month=sep,
    year=2024
}

License

CC BY 4.0

line / LibriTTS-P

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

File Details

Citation

License

About