EricKong1985 / VALL-E-X

An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io

VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning 🔊

An open source implementation of Microsoft's VALL-E X zero-shot TTS model.
Trained model will be released after this repository is fully ready.

⭐️ Welcome to VALL-E X! ⭐️

VALL-E X is an amazing multilingual text-to-speech (TTS) model inspired by Microsoft's groundbreaking research. While Microsoft initially proposed the concept in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are excited to share our trained VALL-E X model with the community, allowing everyone to experience the power of personalized speech synthesis and voice cloning! 🎧

More details about the model are presented in model card.

💻 Installation

Install with pip

https://github.com/Plachtaa/VALL-E-X.git
cd VALL-E-X
pip install --no-error-on-external -r requirements.txt

❗❗❗ Special Notes ❗❗❗

Japanese g2p tool pyopenjtalk may fail to build during installation, you may ignore it if you don't require Japanese TTS functionality. We are currently searching for more stable substitution.

🎧 Demos

Not ready to set up the environment on your local machine just yet? No problem! We've got you covered with our online demos. You can try out VALL-E X directly on Hugging Face or Google Colab, experiencing the model's capabilities hassle-free!

📢 Features

VALL-E X comes packed with cutting-edge functionalities:

Multilingual TTS: Speak in three languages - English, Chinese, and Japanese - with natural and expressive speech synthesis.
Zero-shot Voice Cloning: Enroll a short 3~10 seconds recording of an unseen speaker, and watch VALL-E X create personalized, high-quality speech that sounds just like them!

see example

prompt.webm

output.webm

Speech Emotion Control: Experience the power of emotions! VALL-E X can synthesize speech with the same emotion as the acoustic prompt provided, adding an extra layer of expressiveness to your audio.

see example

sleepy-prompt.mp4

sleepy-output.mp4

Zero-shot Cross-Lingual Speech Synthesis: Take monolingual speakers on a linguistic journey! VALL-E X can produce personalized speech in another language without compromising on fluency or accent. Below is a Japanese speaker talk in Chinese & English. 🇯🇵 🗣

see example

jp-prompt.webm

en-output.webm

zh-output.webm

Accent Control: Get creative with accents! VALL-E X allows you to experiment with different accents, like speaking Chinese with an English accent or vice versa. 🇨🇳 💬

see example

en-prompt.webm

zh-accent-output.webm

en-accent-output.webm

Acoustic Environment Maintenance: No need for perfectly clean audio prompts! VALL-E X adapts to the acoustic environment of the input, making speech generation feel natural and immersive.

see example

noise-prompt.webm

noise-output.webm

Explore our demo page for a lot more examples!

🐍 Usage in Python

🪑 Basics

from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
text_prompt = """
Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(text_prompt)

# save audio to disk
write_wav("vallex_generation.wav", SAMPLE_RATE, audio_array)

# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)

hamburger.webm

🌎 Foreign Language

This VALL-E X implementation also supports Chinese and Japanese. All three languages have equally awesome performance!

text_prompt = """
    チュソクは私のお気に入りの祭りです。 私は数日間休んで、友人や家族との時間を過ごすことができます。
"""
audio_array = generate_audio(text_prompt)

vallex_japanese.webm

Note: VALL-E X controls accent perfectly even when synthesizing code-switch text. However, you need to manually denote language of respective sentences (since our g2p tool is rule-base)

text_prompt = """
    [EN]The Thirty Years' War was a devastating conflict that had a profound impact on Europe.[EN]
    [ZH]这是历史的开始。 如果您想听更多，请继续。[ZH]
"""
audio_array = generate_audio(text_prompt, language='mix')

vallex_codeswitch.webm

📼 Voice Presets

VALL-E X provides tens of speaker voices which you can directly used for inference! Browse all voices in the code

VALL-E X tries to match the tone, pitch, emotion and prosody of a given preset. The model also attempts to preserve music, ambient noise, etc.

text_prompt = """
I am an innocent boy with a smoky voice. It is a great honor for me to speak at the United Nations today.
"""
audio_array = generate_audio(text_prompt, prompt="dingzhen")

smoky.webm

🎙Voice Cloning

VALL-E X supports voice cloning! You can make a voice prompt with any person, character or even your own voice, and use it like other voice presets.
To make a voice prompt, you need to provide a speech of 3~10 seconds long, as well as the transcript of the speech. You can also leave the transcript blank to let the Whisper model to generate the transcript.

VALL-E X tries to match the tone, pitch, emotion and prosody of a given prompt. The model also attempts to preserve music, ambient noise, etc.

from utils.prompt_making import make_prompt

### Use given transcript
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav",
                transcript="Just, what was that? Paimon thought we were gonna get eaten.")

### Alternatively, use whisper
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav")

Now let's try out the prompt we've just made!

from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
text_prompt = """
Hey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me!
"""
audio_array = generate_audio(text_prompt, prompt="paimon")

# save audio to disk
write_wav("paimon_cloned.wav", SAMPLE_RATE, audio_array)

# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)

paimon_prompt.webm

paimon_cloned.webm

🎢User Interface

Not comfortable with codes? No problem! We've also created a user-friendly graphical interface for VALL-E X. It allows you to interact with the model effortlessly, making voice cloning and multilingual speech synthesis a breeze.
You can launch the UI by the following command:

python launch-ui.py

🙌 Contribute

We welcome contributions from the community to make VALL-E X even better! If you have ideas, bug fixes, or want to add more languages, emotions, or accents to the model, feel free to submit a pull request. Together, we can take VALL-E X to new heights!

⭐️ Show Your Support

If you find VALL-E X interesting and useful, give us a star on GitHub! ⭐️ It encourages us to keep improving the model and adding exciting features.

📜 License

VALL-E X is licensed under the MIT License.

Let your imagination run wild with VALL-E X, and enjoy the fantastic world of multilingual text-to-speech synthesis and voice cloning! 🌈 🎶

Have questions or need assistance? Feel free to open an issue or join our community (not set yet lol)

Happy voice cloning! 🎤

About

An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io

MIT License

Languages

Language:Python 100.0%