entn-at / DiFlow-TTS

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-to-Speech

Repository from Github https://github.comentn-at/DiFlow-TTSRepository from Github https://github.comentn-at/DiFlow-TTS

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-to-Speech

[Project Page]


🚩 Submitted to AAAI 2026

πŸ”₯ News

  • [Coming soon] Release evaluation code.
  • [Coming soon] Release training instructions.
  • [2025.08] Release inference code.
  • [2025.08] This repo is created.

πŸ—£οΈ Overview

DiFlow-TTS is a novel zero-shot text-to-speech system that leverages purely discrete flow matching with factorized speech token modeling.




πŸ› οΈ Dependencies & Installation

1. Set Up the Environment

Install the required dependencies using Conda:

conda env create -f environment.yaml
conda activate diflow

2. Download Models

  • Download the pretrained FACodec model from HuggingFace, and place the checkpoint files in the following structure:
root/
└── models/
    └── facodec/
        └── checkpoints/
            β”œβ”€β”€ ns3_facodec_encoder.bin
            └── ns3_facodec_decoder.bin

  • Download the DiFlow-TTS model checkpoint from Link, and place it as follows:
root/
└── ckpts/
    └── diflow-tts.ckpt

πŸš€ 2. Quick Inference

To synthesize a sample with DiFlow-TTS, follow these steps:

  1. Open the script: scripts/synth_one_sample.sh

  2. Edit the following lines:

    • Line 3: Set the path to the DiFlow-TTS checkpoint.
    • Line 4: Set your input text.
    • Line 5: Set the path to your reference speech prompt.
  3. Run the script with:

CUDA_VISIBLE_DEVICES=0 bash scripts/synth_one_sample.sh

Make sure the model checkpoint and audio prompt are correctly formatted and accessible at the specified paths.

πŸ‹οΈβ€β™‚οΈ 3. Training

Coming soon

About

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-to-Speech


Languages

Language:Python 99.7%Language:Shell 0.3%