Modified Version To Train NaturalSpeech2

本仓库是对 Amphion 仓库的小修改版本，主要修复了原仓库中NaturalSpeech2预处理数据部分的问题，跑通了NaturalSpeech2在Libritts数据集上的训练流程。

修复后的代码在依赖安装完成后，只需数据集的TextGrid就可以进行训练。

接下来会简要说明部署环境，准备数据，进行训练的流程：

依赖安装

按照原仓库指示安装 env.sh 中所需环境即可，对于服务器无法连接外网的，这是一些安装环境的问题解决方案：

需要pip git+github的，可以到github页面看看直接装包。
NS2的codec模型参数下载，需要到仓库源码里找到下载路径，找到cache_file的路径，放到.cache里。
nltk，本地下载后上传。我的下载位置在C:\Users{Username}\AppData\Roaming\nltk_data
monotonic align，仿照VALLE的run.sh操作。

数据处理

由于本地没有MFA，我的TextGrid是网上下载的，来源 github.com/kan-bayashi/LibriTTSLabel

操作流程：

执行 python bins/tts/preprocess.py --config={your path}/amphion/egs/tts/NaturalSpeech2/exp_config.json，生成 train.json （该文件的生成实际上在 libritts.py 第64行，流程还可以优化）
由于 train.json 基于整个数据集生成，而 TextGrid 中只包含一部分，所以需要对 train.json 进行一定的清理，可以先把 egs/tts/NaturalSpeech2/trainjson_filter.py 中关于duration等部分注释，只保留TextGrid部分后运行。得到新的 train-clean.json 文件。
修改 egs/tts/NaturalSpeech2/exp_config.json 中的 train_file 和 textgrid_dir 为你自己的目录。
执行 sh egs/tts/NaturalSpeech2/run_preprocess.sh，生成 duration, pitch, phone等数据。
执行 python egs/tts/NaturalSpeech2/generate_code.py 生成 code，注意修改数据集对应的路径。

训练

执行 sh egs/tts/NaturalSpeech2/run_train.sh 即可。如果希望从checkpoint继续训练，可加入参数 --resume --checkpoint_path "[checkpointpath]"

在 exp_config.json 中可以修改：

max_epoch（最大训练轮数）
save_checkpoint_stride（多少步保留一次checkpoint，一次大约4.7G）
batch_size（批次大小，用于降低显存占用）

训练生成的模型位置在 ckpt/tts/ns2_libritts。

推理

bash egs/tts/NaturalSpeech2/run_inference.sh --text "[The text you want to generate]

输出在 output 文件夹中。在sh文件中可以修改推理使用的checkpoint。

Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,

TTS: Text to Speech (⛳ supported)
SVS: Singing Voice Synthesis (👨‍💻 developing)
VC: Voice Conversion (👨‍💻 developing)
SVC: Singing Voice Conversion (⛳ supported)
TTA: Text to Audio (⛳ supported)
TTM: Text to Music (👨‍💻 developing)
more…

In addition to the specific generation tasks, Amphion also includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.

Here is the Amphion v0.1 demo, whose voice, audio effects, and singing voice are generated by our models. Just enjoy it!

Amphion-Demo-EN.mp4

🚀 News

2024/02/22: The first Amphion visualization tool, SingVisio, release.
2023/12/18: Amphion v0.1 release.
2023/11/28: Amphion alpha release.

⭐ Key Features

TTS: Text to Speech

Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
- FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
- VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
- Vall-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
- NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.

SVC: Singing Voice Conversion

Ampion supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec. Their specific roles in SVC has been investigated in our NeurIPS 2023 workshop paper.
Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses Bidirectional dilated CNN as a backend and supports several sampling algorithms such as DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.

TTA: Text to Audio

Amphion supports the TTA with a latent diffusion model. It is designed like AudioLDM, Make-an-Audio, and AUDIT. It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper.

Vocoder

Amphion supports various widely-used neural vocoders, including:
- GAN-based vocoders: MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet.
- Flow-based vocoders: WaveGlow.
- Diffusion-based vocoders: Diffwave.
- Auto-regressive based vocoders: WaveNet, WaveRNN.
Amphion provides the official implementation of Multi-Scale Constant-Q Transform Discriminator (our ICASSP 2024 paper). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged.

Evaluation

Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:

F0 Modeling: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
Energy Modeling: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
Intelligibility: Character/Word Error Rate, which can be calculated based on Whisper and more.
Spectrogram Distortion: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
Speaker Similarity: Cosine similarity, which can be calculated based on RawNet3, Resemblyzer, WeSpeaker, WavLM and more.

Datasets

Amphion unifies the data preprocess of the open-source datasets including AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK, and more. The supported dataset list can be seen here (updating).

Visualization

Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.

Currently, Amphion supports SingVisio, a visualization tool of the diffusion model for singing voice conversion.

📀 Installation

Amphion can be installed through either Setup Installer or Docker Image.

Setup Installer

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

# Install Python Environment
conda create --name amphion python=3.9.15
conda activate amphion

# Install Python Packages Dependencies
sh env.sh

Docker Image

Install Docker, NVIDIA Driver, NVIDIA Container Toolkit, and CUDA.
Run the following commands:

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

docker pull realamphion/amphion
docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion

Mount dataset by argument -v is necessary when using Docker. Please refer to Mount dataset in Docker container and Docker Docs for more details.

🐍 Usage in Python

We detail the instructions of different tasks in the following recipes:

👨‍💻 Contributing

We appreciate all contributions to improve Amphion. Please refer to CONTRIBUTING.md for the contributing guideline.

🙏 Acknowledgement

ming024's FastSpeech2 and jaywalnut310's VITS for model architecture code.
lifeiteng's VALL-E for training pipeline and model architecture design.
WeNet, Whisper, ContentVec, and RawNet3 for pretrained models and inference code.
HiFi-GAN for GAN-based Vocoder's architecture design and training strategy.
Encodec for well-organized GAN Discriminator's architecture and basic blocks.
Latent Diffusion for model architecture design.
TensorFlowTTS for preparing the MFA tools.

©️ License

Amphion is under the MIT License. It is free for both research and commercial use cases.

📚 Citations

@article{zhang2023amphion,
      title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit}, 
      author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Haorui He and Chaoren Wang and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
      journal={arXiv},
      year={2024},
      volume={abs/2312.09911}
}

xiangmy21 / Amphion-modified