Stardust-minus / NaturalSpeech2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NaturalSpeech 2 (WIP)

  • This code is a unofficial implementation of NaturalSpeech 2.
  • The algorithm is based on the following paper:
Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., ... & Bian, J. (2023). NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. arXiv preprint arXiv:2304.09116.

Modifications from Paper

  • The structure is derived from NaturalSpeech 2, but I made several modifications.
  • About CE-RVQ
    • The CE-RVQ implementation in the current repository is incomplete.
      • I had doubts about the loss calculation formula mentioned in the paper, so the previous implementation has been commented out.
      • I think the current implementation aligns with the purpose of CE-RVQ, but deviates from the paper.
      • The current implementation has not been verified how positively this loss contributes to model training.
      • I would greatly appreciate any advice or suggestions you may have regarding this matter.
    • The CE-RVQ loss is selectively applied to a random subset of RVQ layers at each step.
      • Since CE-RVQ consumes a significant amount of memory, I applied sampling to reduce memory usage.
      • If you want to apply it to the entire RVQ layers, please modify the hyperparameter hp.Diffusion.CERVQ.Num_Sample.
      • Based on the suggestion from @Autonomof, I have added a functionality to increase the weight of the initial layers during the sampling of the CE-RVQ layers. If you set hp.Diffusion.CERVQ.Use_Weighted_Sample == true, the weights will be taken into account.
  • The audio codec has been changed to Meta's Encodec 24Khz.
    • This is done to reduce the time spent training a separate audio codec.
    • The model uses 16Khz audio, but no audio resampling is applied.
    • The dimension of Encodec is 128, which is smaller than the hyperparameter provided in the paper, which is 256. This may be a reason of the performance degradation.
    • To maintain similarity with the paper, it may be better to apply Google's SoundStream instead of Encodec, but I couldn't apply SoundStream to this repository because official pyTorch source code or pretrained model was not provided.
      • There is an unverified implementation of SoundStream in Codec.py, so please refer to it.
      • Although this repository does not use, there is also a c++ or tflite version of Lyra, which may allow the application of SoundStream using it.
  • Information on the segment length σ of the speech prompt during training was not found in the paper and was arbitrarily set.
    • The σ = 3, 5, and 10 seconds used in the evaluation of paper are too long to apply to both the variance predictor and diffusion during training.
    • To ensure stability in pattern usage, half the length of the shortest pattern used in each training is set as σ for each training.
  • The target duration is obtained through Alignment learning framework (ALF), rather than being brought in externally.
    • Using external modules such as Montreal Force Alignment (MFA) may have benefits in terms of training speed or stability, but I prioritized simplifying the training process.
  • Padding is applied between tokens like 'A <P> B <P> C ....'
    • I could not verify whether there was a difference in performance depending on its usage.

Supported dataset

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in Hyper_Parameters.yaml according to your environment.

  • Sound

    • Setting basic sound parameters.
  • Tokens

    • The number of token.
    • After pattern generating, you can see which tokens are included in the dataset at Token_Path.
  • Audio_Codec

    • Setting the audio codec.
    • This repository is using Encodec, so only the size of the latents output from Encodec's encoder is set for reference in other modules.
  • Train

    • Setting the parameters of training.
  • Inference_Batch_Size

    • Setting the batch size when inference
  • Inference_Path

    • Setting the inference path
  • Checkpoint_Path

    • Setting the checkpoint path
  • Log_Path

    • Setting the tensorboard log path
  • Use_Mixed_Precision

    • Setting using mixed precision
  • Use_Multi_GPU

    • Setting using multi gpu
    • By the nvcc problem, Only linux supports this option.
    • If this is True, device parameter is also multiple like 0,1,2,3.
    • And you have to change the training command also: please check multi_gpu.sh.
  • Device

    • Setting which GPU devices are used in multi-GPU enviornment.
    • Or, if using only CPU, please set '-1'. (But, I don't recommend while training.)

Generate pattern

python Pattern_Generate.py [parameters]

Parameters

  • -lj
    • The path of LJSpeech dataset
  • -vctk
    • The path of VCTK dataset
  • -libri
    • The path of LbiriTTS dataset
  • -hp
    • The path of hyperparameter.

About phonemizer

  • To phoneme string generate, this repository uses phonimizer library.
  • Please refer here to install phonemizer and backend
  • In Windows, you need more setting to use phonemizer.
    • Please refer here
    • In conda enviornment, the following commands are useful.
      conda env config vars set PHONEMIZER_ESPEAK_PATH='C:\Program Files\eSpeak NG'
      conda env config vars set PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll'

Run

Command

Single GPU

python Train.py -hp <path> -s <int>
  • -hp <path>

    • The hyper paramter file path
    • This is required.
  • -s <int>

    • The resume step parameter.
    • Default is 0.
    • If value is 0, model try to search the latest checkpoint.

Multi GPU

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=32 python -m torch.distributed.launch --nproc_per_node=8 Train.py --hyper_parameters Hyper_Parameters.yaml --port 54322

TODO

  • Verification

About

License:MIT License


Languages

Language:Python 53.5%Language:Jupyter Notebook 46.4%Language:Shell 0.1%