DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Train a Finnish checkpoint from scratch

Annie-Zhou1997 opened this issue · comments

If I want to train a Finnish checkpoint from scratch using Toucan, the dataset I am using is css10fi, which has about 10.5 hours of data. How many steps should I train approximately to achieve good results? I have already trained up to 280k steps, but the quality still bad and can't match that of the Finnish produced by the pre-trained checkpoint. I look forward to your reply!

280k steps is already a lot, usually a training from scratch on this amout of data should be done after 100k steps. The data is also decent quality, so I don't think that this is the issue either. Have you changed any settings, like the batchsize or the learningrate?

Thank you very much for your prompt reply! I haven't changed any settings, but I have a bit of confusion about the training part, and I'm not sure if I made the changes correctly. First, I modified the build_path_to_transcript_dict_css10fi function to point to my actual dataset path:

def build_path_to_transcript_dict_css10fi():
    path_to_transcript = dict()
    language = "finnish"
    with open("/scratch/s5480698/fi/transcript.txt", encoding="utf8") as f:
        transcriptions = f.read()
    trans_lines = transcriptions.split("\n")
    for line in trans_lines:
        if line.strip() != "":
            path_to_transcript[f"/scratch/s5480698/fi/{line.split('|')[0]}"] = \
                line.split("|")[2]
    return limit_to_n(path_to_transcript)

Then I copied the finetune_example, only changing the dataset part:

finnish_datasets = list()
finnish_datasets.append(prepare_fastspeech_corpus(transcript_dict=build_path_to_transcript_dict_css10fi(),
                                                corpus_dir=os.path.join(PREPROCESSING_DIR, "CSS10fi"),
                                                lang="fi", fine_tune_aligner=False, ctc_selection=False))

all_train_sets.append(ConcatDataset(finnish_datasets))

model = ToucanTTS()
if use_wandb:
    wandb.init(
        name=f"{__name__.split('.')[-1]}_{time.strftime('%Y%m%d-%H%M%S')}" if wandb_resume_id is None else None,
        id=wandb_resume_id, resume="must" if wandb_resume_id is not None else None)
print("Training model")
train_loop(net=model,
           datasets=all_train_sets,
           device=device,
           save_directory=save_dir,
           batch_size=12,  # YOU MIGHT GET OUT OF MEMORY ISSUES ON SMALL GPUs, IF SO, DECREASE THIS.
           eval_lang="fi",  # THE LANGUAGE YOUR PROGRESS PLOTS WILL BE MADE IN
           warmup_steps=500,
           lr=1e-5,  # if you have enough data (over ~1000 datapoints) you can increase this up to 1e-3 and it will still be stable, but learn quicker.
           # DOWNLOAD THESE INITIALIZATION MODELS FROM THE RELEASE PAGE OF THE GITHUB OR RUN THE DOWNLOADER SCRIPT TO GET THEM AUTOMATICALLY
           # path_to_checkpoint=os.path.join(MODELS_DIR, "ToucanTTS_Meta", "best.pt") if resume_checkpoint is None else resume_checkpoint,
           path_to_embed_model=os.path.join(MODELS_DIR, "Embedding", "embedding_function.pt"),
           fine_tune=False,# if resume_checkpoint is None and not resume else finetune,
           resume=resume,
           steps=300000,
           use_wandb=use_wandb)
if use_wandb:
    wandb.finish()

Since I wanted to train from scratch, I commented out the step of calling the meta pre-trained checkpoint. Then I called
from TrainingInterfaces.TrainingPipelines.ToucanTTS_Finnish import run as finnish
and used this command to train:python3 run_training_pipeline.py finnish --gpu_id 0
I'm not sure exactly where the problem is. Thank you very much!

You did everything correctly, that's all good :)

The problem is most likely that some of the settings in the finetune_example are set up specifically for finetuning.

Try a higher learningrate and more warmup steps when training from scratch. For lr, something between 0.001 and 0.0005 usually works well. And for warmup steps I would go with a few thousand, maybe 4000.

I'm working on a new version that will be released in a few weeks, might be worth to try again once that's done.

Thank you very much for your reply!
This afternoon, I tried training from scratch using the LJSpeech English dataset, and by 60k steps, the results were already quite good. I followed your suggestion in the comments to change the learning rate to lr=1e-3, and used the default warm-up steps and total training steps in the train_loop. I think there might be some issues with my Finnish dataset, as I noticed many errors in the transcription files. I enabled the ctc function and manually corrected the dataset, hoping it will be successful this time.
I look forward to your new version and wish you success in your work!

Thanks! I'll close this issue for now and we assume that it's the mislabellings in the data. If you find that's not the problem, feel free to re-open the issue.