audio-synthesis fastapi generative-model python3 speech-enhancement transfer-learning tts voice-cloning zero-shot-learning

Voice Cloning Model with Zero-Shot Attention-Based TTS

The AI used in this API is the YourTTS Zero-Shot Multispeaker TTS implementation of generative audio modeling.

The paper that proposed the YourTTS model was used as a central building block of the API. YourTTS for a multilingual approach for zero-shot multi-speaker TTS which can be utilized on multilingual audio data while building on older VITS approaches.

Reference Implementations used to study TTS concepts can be found here

The Models Researched under open source as provided from Coqui

Model	URL
Speaker Encoder	link
Exp 1. YourTTS-EN(VCTK)	link
Exp 1. YourTTS-EN(VCTK) + SCL	link
Exp 2. YourTTS-EN(VCTK)-PT	link
Exp 2. YourTTS-EN(VCTK)-PT + SCL	link
Exp 3. YourTTS-EN(VCTK)-PT-FR	link
Exp 3. YourTTS-EN(VCTK)-PT-FR SCL	link
Exp 4. YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL	link

TTS Retraining Data

The audios for the MOS are available here. Also, the MOS the audios are here.

Default TTS Audio Sources:

LibriTTS (test clean): 1188, 1995, 260, 1284, 2300, 237, 908, 1580, 121 and 1089

VCTK: p261, p225, p294, p347, p238, p234, p248, p335, p245, p326 and p302

MLS Portuguese: 12710, 5677, 12249, 12287, 9351, 11995, 7925, 3050, 4367 and 1306

Citation


@ARTICLE{2021arXiv211202418C,
  author = {{Casanova}, Edresson and {Weber}, Julian and {Shulby}, Christopher and {Junior}, Arnaldo Candido and {G{\"o}lge}, Eren and {Antonelli Ponti}, Moacir},
  title = "{YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone}",
  journal = {arXiv e-prints},
  keywords = {Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing},
  year = 2021,
  month = dec,
  eid = {arXiv:2112.02418},
  pages = {arXiv:2112.02418},
  archivePrefix = {arXiv},
  eprint = {2112.02418},
  primaryClass = {cs.SD},
  adsurl = {https://ui.adsabs.harvard.edu/abs/2021arXiv211202418C},
  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

About

Generative voice cloning model using TTS synthesis with state-of-the-art Zero-Shot Multi-Speaker functionality. An web api built with the YourTTS TTS model to clone and generate realistic audio waves

audio-synthesis fastapi generative-model python3 speech-enhancement transfer-learning tts voice-cloning zero-shot-learning

Apache License 2.0

Languages

Language:Python 100.0%