Deep-learning-based Text-to-speech-TTS Papers and resources

Various Text-to-speech (TTS) papers and resources based on Deep-learning

Data

[Melspectrogram]

Speech Technology: A Practical Introduction, Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis (K. Prahallad., CMU, slide, video)

Mel-spectrogram Generator

[Autoregressive]

RNN

Char2Wav (J. Soleto et. al., Feb. 2017., Motreal., paper)
Tacotron (Y. Wang et. al., Mar. 2017., Google, arxiv)
Tacotron 2 (J. Shen et.al., Dec. 2017., Google, arxiv)

CNN

Deep Voice 3 (W. Ping et. al., Oct. 2017., Baidu, arxiv)
Deep Convolutional Text-to-speech (H. Tachibana et. al., Oct. 2017., arxiv)

Transformer

Transformer TTS (N. Li et. al., Sep. 2018., Microsoft, arxiv)

[Non-autoregressive]

CNN

ParaNet (K. Peng et. al., May. 2019., Baidu, arxiv)

Transformer

Fast Speech (Y. Ren et. al., May. 2019., Microsoft, arxiv)
Align TTS (Z. Zeng et. al., Mar. 2020., Ping An Tech., arxiv)
Fast Speech 2 (Y. Ren et. al., Jun. 2020., Microsoft, arxiv)

[Graph Neural Networks]

Graph TTS (A. Sun et. al., Mar. 2020., Ping An Tech., arxiv)

[Attention Improvement]

Monotonic Attention (C. Raffel et. al., Jun. 2017., Google Brain, arxiv)
Monotonic Chunkwise Attention (C.C. Chiu et. al., Dec. 2017., Google Brain, arxiv)
Stepwise Monotonic Attention (M. He et. al., Jun. 2019., Microsoft, arxiv)
Location-relative Attention Mechanisms for Robust Long-form Speech Synthesis (E. Battenberg et. al., Oct. 2019., Google, arxiv)

[Training Algorithm]

A New GAN-based Training Algorithm (H. Guo. et. al., Apr. 2019., Microsoft, arxiv)

[Data-Efficient]

Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis (Y.A. Chung et. al., Aug. 2018., MIT & Google, arxiv)
Sample Efficient Adaptive Text-to-Speech (Y. Chen., Jan. 2019., DeepMind & Google, arxiv)

Neural Vocoder

[Autoregressive Model]

WaveNet (A. V. Oord et. al., Sep. 2016., Deep Mind, arxiv)
SampleRNN (S. Mehri et. al., Dec. 2016, Montreal, arxiv)

[Inverse Autoregressive Flow Model]

Parallel WaveNet (A. V. Oord et. al., Nov. 2017., Deep Mind, arxiv)
ClariNet (W. Ping et. al., Jul. 2018., Baidu, arxiv)
WaveGlow (R. Prenger et. al., Nov. 2018., NVIDIA, arxiv)
FlowWaveNet (S. Kim et. al., Nov. 2018., SNU, arxiv)

[Generative Adversarial Network]

WaveGAN (C.Donahue et. al., Feb. 2018., UCSD, arxiv)
GAN-TTS (M. Binkowski, Sep. 2019., Google, arxiv)
Parallel WaveGAN (R. Yamamoto et. al., Oct. 2019., Naver, arxiv)

Style Modeling

[Style Token]

Uncovering Latent Style Factors for Expressive Speech Synthesis (Y. Wang et. al., Nov. 2017., Google, arxiv)
GST Tacotron (Y. Wang et. al., Mar. 2018., Google, arxiv)
TP-GST Tacotron (D. Santon et al., Aug. 2018., Google, arxiv)

[Generative Adversarial Network]

TTS-GAN (S. Ma et. al., Apr. 2019., Microsoft, paper)
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning (Y. Zhang et. al., Jul. 2019., Google, arxiv)
Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization (W. N. Hsu et. al., Sep. 2019., Google, paper)

[Mutual Information]

Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis (T.Y. Hu et. al., Mar. 2020., CMU & Apple, arxiv)

Dataset

[English]

ASR

LibriSoeech dataset (paper, download) (V. Panayotov et al., 2015)
- total 2484 speakers, 1000+ hours
VocCeleb1 dataset (paper download) (A. Nagrani et al., 2017)
- 151,516 utterances, 1251 speakers, 352 hours
VoxCeleb2 dataset (paper download) (J. S> Chung et al., 2018)
- 1,128,246 utterances, 6112 speakers, 2442 hours

TTS

LJSpeech dataset download (Keith Ito and Linda Johnson, 2017)
- single female speaker, 13100 samples, approximately 24 hours
CSTR VCTK Corpus (download (C. Veaux et. al.))
- 109 english speakers with various accents, 400 speeches per speaker
Blizzard dataset download

[Korean]

ASR

한국어 및 영어 음향모델 훈련용 음성 데이터 (download) (ETRI)
- (Korean speech) 50 speakers * 100 speeches/speaker (total 5,000 speech samples)
- (English speech pronounciated by Korean) 50 speakers * 100 speech (total 5,000 speech samples)
음성인터페이스 개발을 위한 어린이 음성 데이터 (download) (ETRI)
- 50 speakers * 100 speeches/speaker * 3 environments (total 16,200 speech samples)
- speaker info: elementary school students (from 1st to 4th grade)
- recorded from IPhone5, Samsung GalaxyS4, and microphones
ClovaCall datset (paper download) (Naver Corp.)
- 140000+ speeches, 211+ hours of noisy and clean speech
KSponSpeech (download) (ETRI)
- 2000 speakers, 1000+ hours, various topics(life, shopping, hobby, weather, etc..)
Korean Read Speech Corpus (download) (국립국어원)
- 8 speakers, 120+ hours
Zeroth-Korean (download) (Lucas Jo and Wonkyum Lee)
- 115 speakers, 52.8 hours, 22720 utterances in total
Pansori TED x KR Corpus (paper download) (Y. Choi and B. Lee)
- 3 hours, 41 speakers

TTS

Korean Single Speaker (KSS) Speech Dataset (download) (K. Park)
- single female speaker, 12853 samples, 12+ hours
감정 음성합성 데이터셋 (download) ((주) 아크릴)
- single female speaker, 7 emotion (neutral, sad, fear, happy, angry, disgusting, surprise), total 22,000 samples (about 3,000 samples per emotion)
EmotionTTS-Open-DB dataset (download) (KAIST and (주) 셀바스AI)
- single-speaker, multi-speaker, and multi-speaker-multi-emotion dataset
카이스트 오디오북 데이터셋 (download) (KAIST)
- 58559 speeches, 72+ hours, 13 speakers
- various reading materials (news, novel, etc..)

swagat25 / Awesome-DL-based-Text-to-speech-Papers-and-Resources

Deep-learning-based Text-to-speech-TTS Papers and resources

Data

[Melspectrogram]

Mel-spectrogram Generator

[Autoregressive]

RNN

CNN

Transformer

[Non-autoregressive]

CNN

Transformer

[Graph Neural Networks]

[Attention Improvement]

[Training Algorithm]

[Data-Efficient]

Neural Vocoder

[Autoregressive Model]

[Inverse Autoregressive Flow Model]

[Generative Adversarial Network]

Style Modeling

[Style Token]

[Generative Adversarial Network]

[Mutual Information]

Dataset

[English]

ASR

TTS

[Korean]

ASR

TTS

About