yl4579 / StyleTTS2

Thank you for the code and work.
I'm trying to run the second stage training, and step into the breakpoint part since the g_loss is None, any thoughts on that?

StyleTTS2/train_second.py

Line 450 in fd3884b

set_trace()

When did this happen? Is it before or after diffusion model training? Is this before or after SLM adversarial training? I have noticed it happens several times myself which is why I put a set_trace here, and there are a few reasons:

The prosodic style encoder is not initialized correctly (which is unlikely as long as you don’t change the code and stage one checkpoint is valid).
The discriminator kicks in too early (i.e., diff_epoch is too low).
SLM adversarial training makes the model unstable (so you need to make skip update higher and clip scale lower, though this is unlikely because the current setting works for many different datasets).
PL-BERT is not set up correctly (for example, you aren’t training an English model but used PL-BERT trained on English).

It is within the first epoch of training under train_second.py. I am doing it with LJspeech.

What is your config? With the settings in this repo, I don't have this issue, so it's probably related to stuff like learning rate and batch size etc.

Also, check if your first stage model has reasonable quality in reconstruction in tensorboard. It should be perceptually indistinguishable from the ground truth, otherwise something is wrong with your first stage too.

I kept most of your config, except that I increased the batch size and learning rate since I use 8 GPUs with larger memory. I set batch_size: 48 and increased lr by 3 times. The reconstruction you mean the audio right, I checked the audio in eval tab, I feel those are good.
however, I found the gen_loss is increasing and d_loss does not decrease once the epoch > TMA_epoch. Is that something unexpected?

You should not increase the learning rate by 3 times, especially for PL-BERT; I believe this is where the problem is. I suggest you keep the learning rate unchanged even though you have a higher batch size. The highest batch size I have tried was 32 but I used the same learning rate. The demo samples on styletts2.github.io were generated using the model trained with a batch size of 32 with the exact same learning rate (they are slightly different from the one trained with a batch size of 16, but the quality is pretty much the same).

The following is the learning curve I have for the first stage model. If this is what you see in your tensorboard too, it should be fine. The loss increase is mostly caused by feature matching loss, as the features are getting more and more hard to catch because the discriminator is overfitting. You can see figure 3 of https://dl.acm.org/doi/pdf/10.1145/3573834.3574506, this is normal.

Thank you for the knowledge sharing, I think my stage 1 training loss trajectory plot looks good based on the comparison. Trying what you suggested, so far no issues are shown in the first several epochs. Will continue the model training, and keep you posted. Thank you again.

Hi, I found the same issue happen again in the 9th epoch of second-stage training. The loss_mel is Nan. I use a batch size of 32 with 8 GPUs, and others are the same as your config.

This is so weird. Can you try to lower it to 16 instead? Does it still happen if the batch size is 16?

Any update on batch size 16? Or is it because you used a different learning rate for the first stage model?

In the second stage of training, I kept the batch to 16, and the Nan issues are not shown again with 8 GPU training.
However, once the epoch reaches 50, which is set for joint_epoch in config, I have run into the errors:

Traceback (most recent call last): File "StyleTTS2/train_second.py", line 789, in <module> main() File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "StyleTTS2/train_second.py", line 497, in main loss_gen_lm.backward() File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 1, 1]] is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The issue happens during the backward for loss_gen_lm. my pytorch version is 2.1.0

This is likely caused by having too many GPUs but too few samples in a batch. Can you change batch_percentage to 1 instead?

Hi, the error still exists after setting the batch_percentage to 1.
I have not dived deep into the code, but had a quick look at lines 488-495,Is the error related to this issue mentioned in https://stackoverflow.com/questions/69163522/one-of-the-variables-modified-by-an-inplace-operation

StyleTTS2/train_second.py

Line 488 in 21f7cb9

if d_loss_slm != 0:

Your errors are so weird. I think it works all fine for me. Can you use 4 GPUs instead of 8? Or is it related to the CUDA version?

Or I guess this codebase probably has some bugs with PyTorch because it has several weird issues like predictor_encoder.train() makes the F0 loss higher, it has high frequency background noise for old GPUs, it causes NaN with batch size 32 etc. I hope someone can reimplement everything because there’s probably something wrong in my code. The training pipeline was all written by myself instead of modified from some existing codebase (except a few modules like iSTFTNet, diffusion models etc.), so weird glitches are very likely.

Hi, thank you for sharing your concerns. I don't think this is related to GPU, since after I set a breakpoint, I found when the d_loss_slm is non-zero, and loss_gen_lm is non-zero, the error will happen. When the d_loss_slm is 0, it works without errors. I guess it should be related to the two times backward().

Does it cause different behavior though?

Hi, i am facing the same problem. I did not change the kearning rate. I changesd the batch size and mx_len. The nan values start coming in the very first step of training.
my configs were:

log_dir: "Models/LJSpeech"
first_stage_path: "/home/ubuntu/projects/python/akshat/StyleTTS2/Models/LJSpeech/epoch_1st_00130.pth"
save_freq: 2
log_interval: 10
device: "cuda"
epochs_1st: 200 # number of epochs for first stage training (pre-training)
epochs_2nd: 100 # number of peochs for second stage training (joint training)
batch_size: 12
max_len: 100 # maximum number of frames
pretrained_model: ""
second_stage_load_pretrained: False # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters

F0_path: "Utils/JDC/bst.t7"
ASR_config: "Utils/ASR/config.yml"
ASR_path: "Utils/ASR/epoch_00080.pth"
PLBERT_dir: 'Utils/PLBERT/'

data_params:
train_data: "Data/train_list.txt"
val_data: "Data/val_list.txt"
root_path: "LJSpeech-1.1/wavs"
OOD_data: "Data/OOD_texts.txt"
min_length: 50 # sample until texts with this size are obtained for OOD texts

preprocess_params:
sr: 24000
spect_params:
n_fft: 2048
win_length: 1200
hop_length: 300

model_params:
multispeaker: false

dim_in: 64
hidden_dim: 512
max_conv_dim: 512
n_layer: 3
n_mels: 80

n_token: 178 # number of phoneme tokens
max_dur: 50 # maximum duration of a single phoneme
style_dim: 128 # style vector size

dropout: 0.2

config for decoder

decoder:
type: 'istftnet' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10, 6]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20, 12]
gen_istft_n_fft: 20
gen_istft_hop_size: 5

speech language model config

slm:
model: 'microsoft/wavlm-base-plus'
sr: 16000 # sampling rate of SLM
hidden: 768 # hidden size of SLM
nlayers: 13 # number of layers of SLM
initial_channel: 64 # initial channels of SLM discriminator head

style diffusion model config

diffusion:
embedding_mask_proba: 0.1
# transformer config
transformer:
num_layers: 3
num_heads: 8
head_features: 64
multiplier: 2

# diffusion distribution config
dist:
  sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
  estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
  mean: -3.0
  std: 1.0

loss_params:
lambda_mel: 5. # mel reconstruction loss
lambda_gen: 1. # generator loss
lambda_slm: 1. # slm feature matching loss

lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
TMA_epoch: 48 # TMA starting epoch (1st stage)

lambda_F0: 1. # F0 reconstruction loss (2nd stage)
lambda_norm: 1. # norm reconstruction loss (2nd stage)
lambda_dur: 1. # duration loss (2nd stage)
lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
lambda_sty: 1. # style reconstruction loss (2nd stage)
lambda_diff: 1. # score matching loss (2nd stage)

diff_epoch: 20 # style diffusion starting epoch (2nd stage)
joint_epoch: 50 # joint training starting epoch (2nd stage)

optimizer_params:
lr: 0.0001 # general learning rate
bert_lr: 0.00001 # learning rate for PLBERT
ft_lr: 0.00001 # learning rate for acoustic modules

slmadv_params:
min_len: 400 # minimum length of samples
max_len: 500 # maximum length of samples
batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
iter: 10 # update the discriminator every this iterations of generator update
thresh: 5 # gradient norm above which the gradient is scaled
scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
sig: 1.5 # sigma for differentiable duration modeling

Also wheni tried to train the second stage from the checkpoint on Hugginface it worked fine. One thing i noticed was that the checkpoint is trained from scrathed is about 1.7gb but the one on huggingface is about 700mb. Am I doing something wring with the training in the stage 1 or you are not saving the discriminator in the checkpoint on huggingface?

@yl4579 Could you please share your loss chart for the diffusion and duration losses? My model's diffusion doesn't seem to be decreasing and I'm curious what a successful run's diffusion loss looks like.

g_loss is None in second stage training

config for decoder

speech language model config

style diffusion model config