chavinlo / musicgen_trainer

simple trainer for musicgen/audiocraft

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

full code for usage after finetuning?

karen-pal opened this issue · comments

Hello I was able to finetune a model following your instructions. Thank you for your repo. I'm currently stuck trying to use the output model.

This is what I've got so far

from audiocraft.models import MusicGen
import torch

# Using small model, better results would be obtained with `medium` or `large`.
model = MusicGen.get_pretrained('small')

model.lm.load_state_dict(torch.load('models/lm_final.pt'))

When I run this i get something that looks like a success message :

<All keys matched successfully>

but when i want to generate

from audiocraft.utils.notebook import display_audio

output = model.generate(
    descriptions=[
        PROMPT_IN_TRAIN_DATASET_1,
        PROMPT_IN_TRAIN_DATASET_2,    ],
    progress=True
)
display_audio(output, sample_rate=32000)

I don't get anything similar to the training dataset. Can anyone help me out? Am I doing something wrong? I have the feeling I'm not loading correctly my local finetuned model.

Thanks

commented

Currently only overfit works.
I was only able to test it on overfit twice at 3am before I pushed most of the code here so I supposed it was gonna work on longer datasets.
Although I think the main reason is data:
image

commented

Note that this was with 2h of audio

commented

Hey there, was your dataset well prepared? The sounds must be correctly cut in 35-second parts, in .wav MONO 16bits depth and 32 000Hz, I was able to fine tune from the small model with my own music and I had relatively good results (hyperparameters of my training: lr 0.0000001, 10 epochs, use_wandb 1).

commented

Hey there, was your dataset well prepared? The sounds must be correctly cut in 35-second parts, in .wav MONO 16bits depth and 32 000Hz, I was able to fine tune from the small model with my own music and I had relatively good results (hyperparameters of my training: lr 0.0000001, 10 epochs, use_wandb 1).

Mind linking the wandb logs?

commented

Hey there, was your dataset well prepared? The sounds must be correctly cut in 35-second parts, in .wav MONO 16bits depth and 32 000Hz, I was able to fine tune from the small model with my own music and I had relatively good results (hyperparameters of my training: lr 0.0000001, 10 epochs, use_wandb 1).

Mind linking the wandb logs?

I'm sorry, I wrote the argument but I did not even used it for this training. I'll share logs for a next one

commented

Recently I tried to train musicgen on a TTS task and failed. I found that if I used a small dataset, val_loss would reach a still high minimum value. If I used a larger dataset, about 500h, val_loss and train_loss would drop at a very low rate

I tried again with a small dataset but more compact (many examples with the same annotation) and was able to train it successfully by overfitting - i trained it for 30 epochs, all other hyperparameters being the default. Of course it resulted in a complete collapse of the model though - as in, it was afterwards unable to make any other prompt.

commented

@jidanhuang @karen-pal I think the solution could be to just use the original shape (1, 4, 1500) rather than the probs (1, 4, 1500, 2048)
I could try later but I dont have time right now

commented

Sorry, but I don't understand which part of the code should be changed to use the original shape

commented

Currently only overfit works.
I was only able to test it on overfit twice at 3am before I pushed most of the code here so I supposed it was gonna work on longer datasets.
Although I think the main reason is data:
image

How does the model behave with overfitting? Is there some obvious failure mode? Like, does everything sound similar or does output just significantly degrade in audio quality?
I guess I'm just curious what you mean when you say that overfit "works"... ?

Currently only overfit works.
I was only able to test it on overfit twice at 3am before I pushed most of the code here so I supposed it was gonna work on longer datasets.
Although I think the main reason is data:
image

How does the model behave with overfitting? Is there some obvious failure mode? Like, does everything sound similar or does output just significantly degrade in audio quality? I guess I'm just curious what you mean when you say that overfit "works"... ?

I also overfitted the model... The loss got very low and the model couldn't make anything but the fine tuning dataset.... Like it stopped being able to make any other sounds, no more lofi beats nor techno, nothing! I trained it on my voice to see if I could add background music and etc but it became impossible for the model to produce any other type of sound or modify a sound according to an instruction. You can hear the result here https://www.instagram.com/reel/CuDvXSetzKX/

commented

Ah, okay, so it loses any capacity for generalization. Makes sense. Just out of curiosity, what learning rate did you use? Also, did you use a learning rate scheduler—if so, which one?

EDIT — Sorry, yes, I see you used the default params.

@karen-pal
Hi I want to train the model too and want same help.
When you use "lr 0.0000001", the model works? the lr is really small. Is the model really update the parameters using your dataset, or the model actually maintain the state of the pretrained model without updating?

And two weeks ago, you say "overfitted the model." with lr 0.0000001, you model overfitted?

How many hours of training data did you use?

Recently I tried to train musicgen on a TTS task and failed. I found that if I used a small dataset, val_loss would reach a still high minimum value. If I used a larger dataset, about 500h, val_loss and train_loss would drop at a very low rate

@jidanhuang
Hi, how about your training of TTS using 500h dataset? Is the model converged? Can clear wave be synthesized?

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

Thank you for your relply!
I will update my trainning status after a few days

I use https://github.com/neverix/musicgen_trainer for training,I use 35-hours music data to finetune on the pretrained small model, using 7 gpus with batchsize=10 for each gpu. The training loss continues decline to about 3.5,but the valid continues to rise to 6.0+,the generated waves seem good.

I'm thinking that whether the valid loss makes sense since we only use unconditional training mode?Or is the training make sense if the valid loss become larger and larger? @chavinlo

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

I'm also going to be trying this fork.

I already trained a first model with only 5 min of transcribed speech to text and the results are promising... i was able to add spoken text with my voice and also add some musicality. I'll keep you updated with my next results!

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

I'm also going to be trying this fork.

I already trained a first model with only 5 min of transcribed speech to text and the results are promising... i was able to add spoken text with my voice and also add some musicality. I'll keep you updated with my next results!

This fork said that removing the gradient scaler, increasing the batch size and only training on conditional samples makes training work. so i just removed the gradient scaler and only trained on conditional samples on this chavinlo/musicgen_trainer project with the training batch size of 16. However, it it sound like there's no difference in training effect. I can only reduce the loss of val dataset to 3.88 with 500 hours music training dataset. The effect of music generation sounds neither good nor bad. I'm not sure if there's anything wrong with my code.
maybe i will try official training code and report my results.

yeah I heavily suggest using the official repository as it now has training code
I don't plan on updating this repo right now either, too busy on other things