DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-to-Speech

🔥 News

DiFlow-TTS is a novel zero-shot text-to-speech system that leverages purely discrete flow matching with factorized speech token modeling.

Install the required dependencies using Conda:

conda env create -f environment.yaml
conda activate diflow

Download the pretrained FACodec model from HuggingFace, and place the checkpoint files in the following structure:

root/
└── models/
    └── facodec/
        └── checkpoints/
            ├── ns3_facodec_encoder.bin
            └── ns3_facodec_decoder.bin

root/
└── ckpts/
    └── diflow-tts.ckpt

To synthesize a sample with DiFlow-TTS, follow these steps:

Open the script: scripts/synth_one_sample.sh
Edit the following lines:
- Line 3: Set the path to the DiFlow-TTS checkpoint.
- Line 4: Set your input text.
- Line 5: Set the path to your reference speech prompt.
Run the script with:

CUDA_VISIBLE_DEVICES=0 bash scripts/synth_one_sample.sh

Make sure the model checkpoint and audio prompt are correctly formatted and accessible at the specified paths.

Coming soon

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-to-Speech

Language:Python 99.7%Language:Shell 0.3%