seastar105 / pflow-encodec

Implementation of TTS model based on NVIDIA P-Flow TTS Paper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Config setting

yiwei0730 opened this issue · comments

min_duration: 3.5 # minimum duration of files, this value MUST be bigger than 3.0
text2latent_rate: 1.5 # 50Hz:75Hz
seed: 998244353
sample_freq: 5000
trainer.use_distributed_sampler to be False.
I would like to ask about the use of these parameters in setting data config and experiment config, and what their effects are.

  1. Why must the minimum time be set at 3 seconds?
  2. What does 50Hz:75Hz mean?
  3. Is seed related to the parameter sample inside?
  4. What does sample_freq mean?
  5. When I use DDP why DS need to be False, i trained other model always used the DistributedBucketSampler
  1. 3-seconds long prompt is used for training as p-flow paper did, so audio samples for training should be longer than 3-seconds. i set it 3.5 for convenience, to ensure all samples produce proper loss. loss is calculated on only non-prompt region. you can check p-flow paper for more details.

  2. text2latent_rate is to adjust inconsistent frame rate between text_duration and encodec latent. text_duration is calculated with frame rate 50Hz since it's based on XLS-R, and encodec latent has 75Hz frame rate.
    so, it is necessary to upscale text embedding to encodec's frame rate.

  3. seed is just random seed used while training. so it will affect initialization of model.

  4. while training loop, generated samples from model will be logged on tensorboard every sample_freq steps.

  5. this is pytorch-lightning specific issue. AFAIK if use_distributed_sampler is True, Lightning will use their own distributed batch sampler. so this value should be False to use custom DistributedSampler.
    related issue: Lightning-AI/pytorch-lightning#5145

I meet one error when i test the training code
path = self.paths[idx]
IndexError: list index out of range
if idx > self.max_length:
print(idx, self.max_length)
i print the idx and max_length of df 1000 50
i don't know why the max_length just 50, but the idx jump to the 1000

@yiwei0730 need full traceback.
and, what is max_length? it seems no max_length in this repo.

i add it in the TextLatentDataset
self.max_length= len(df.index)
maybe i find the problem if the dataset low than 1000 data than it will break(since i use the 64 data in validation)
if i just setting the val data path same to train data path(200K data), then it work.
but i still don't know why hahaha

The another question is can i add the batch_durations, i saw the 100 duration just fill my gpu in 6000 but i have 48G
how much does it setting with a good para.

oh, i got it.

sample_idx: [0, 1000, 2000, 3000, 4000, 5000]

sample_idx in model config is used for log generated audio while training loop.

# sample with gt duration
for idx, sample_idx in enumerate(self.sample_idx):
text_token, duration, latent = self.trainer.datamodule.val_ds[sample_idx]
start_idx = torch.randint(0, latent.shape[-2] - self.prompt_length, (1,))
prompt = latent[:, start_idx : start_idx + self.prompt_length]
sampled = self.net.generate(
text_token.to(self.device),
prompt.to(self.device),
duration.to(self.device),
upscale_ratio=self.text2latent_ratio,
)
write_to_tb(sampled, f"sampled/gt_dur_{idx}.wav")

so you can adjust this value available values like [0, 1, 2, 3, 4]. idx 1000 was appeared due to this setup. i'm gonna add this info in README too.

you can increase batch_durations about to 200~300, memory consumption would increase bit linearly when you increase batch_durations.

i used 100 and 4 gradient accumulation, so effective batch size i used is 400. i think bigger batch would have better result.

by the way, i remember you are doing mandarin and english. is there any good (24K sample_rate, multi-speaker) public dataset for training mandarin TTS? which dataset are you using now?

OH, thank you. I will use maybe 4 GPU to train in 300 batch_duration hope it will have a good results!!
The mandarin dataset I think Aishell is good and aidatazang . maybe there are more in openslr
But I train in the kingasr(100K) dataset(I think this is not public by the way) and LibriTTS(100K, this is public). If the result is best enough i will use more dataset to testing.
I saw you use the langid in the new submit, is it have a great performance?
If you add a language setting, will it limit your generation to a certain language and not be able to do multi-language generation?

thanks. i'm gonna try aishell or aidatazang. FYI, if you are using 4 gpus with batch_durations, effective batch_durations will be 1200. batch_durations is per device. Good Luck!

For language setting, not tested many yet. i tried adding language embedding and finetune from multilingual checkpoint. but speaker embedding was still highly entangled. so, lang id setting is still on experiment. Fortunately, finetuned model could still generate code-switched speech.

Did you mean accumulate_grad_batches setted in 4 * (duration =300)m =1200? Hope 48G can eat it :>。

"Fortunately, finetuned model could still generate code-switched speech."-> wow, that's surprised, that mean if you setting EN, You can still synthesize Japanese and Korean synthesized languages

i used language id drop while training, so it can inference without lang id.

when i compare samples used lang id or not, there's no significant difference. i think model is not trained well.
need more experiments.

Hello @yiwei0730, nice to meet you here in this repository again. I am also training with datasets in Korean, Chinese, Japanese, and English. I will share the results once they are available.

I am using the same datasets for Korean, Chinese, and Japanese as @seastar105, and for Chinese, I am using the aishell and MAGICDATA. I hope we get good results.

AH! I forget to told you some install error.

  1. I don't know why i can't install the deepfilternet. :<
  2. in the generate.ipynb need to install the audiocraft package, after installed i got error.
    "ImportError: cannot import name 'set_guard_fail_hook' from 'torch._dynamo.eval_frame' (/root/anaconda3/envs/pflow-encodec/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py)"
  3. the generate.ipynb Can I understand this result like this?
    It seems that the original model is still a bit poor when generated, and its audio needs to be reconstructed through decoding to make its sound quality better.

@yiwei0730

  1. i'm not sure why deepfilternet installation has failed. deepfilternet was tried like MetaVoice, but it produced worse audio. so, i dropped it. you do not need to install deepfilternet now. also i tried vocos, but vocos was not good as MBD.
  2. Audiocraft requires torch 2.1.0, so i install it from own forked version. you can check infer-requirements
  3. Encodec reconstruction's quality is bit poor, which would be upper bound of trained model. so i thought model has reasonable performance, and decide to use auxiliary decoder to get better result than Encodec's recon.

It seems original question was answered. Feel free to open issue or reopen here, if you have any related question.

Yes, thank you. If I have other findings and questions, I will immediately inquire and discuss with you! By the way, I found that different decodecs will have different effects. I tried to use the latest Facodec, and the effect seems to be too much reconstruction. Small noise in multiple raw pflow outputs