full code for usage after finetuning?

Question

full code for usage after finetuning?

karen-pal opened this issue a year ago · comments

Hello I was able to finetune a model following your instructions. Thank you for your repo. I'm currently stuck trying to use the output model.

This is what I've got so far

from audiocraft.models import MusicGen
import torch

# Using small model, better results would be obtained with `medium` or `large`.
model = MusicGen.get_pretrained('small')

model.lm.load_state_dict(torch.load('models/lm_final.pt'))

When I run this i get something that looks like a success message :

<All keys matched successfully>

but when i want to generate

from audiocraft.utils.notebook import display_audio

output = model.generate(
    descriptions=[
        PROMPT_IN_TRAIN_DATASET_1,
        PROMPT_IN_TRAIN_DATASET_2,    ],
    progress=True
)
display_audio(output, sample_rate=32000)

I don't get anything similar to the training dataset. Can anyone help me out? Am I doing something wrong? I have the feeling I'm not loading correctly my local finetuned model.

Thanks

chavez · Answer 1 · Sat Jun 17 2023 23:50:27 GMT+0800 (China Standard Time)

Currently only overfit works.
I was only able to test it on overfit twice at 3am before I pushed most of the code here so I supposed it was gonna work on longer datasets.
Although I think the main reason is data:

chavez · Answer 2 · Sat Jun 17 2023 23:50:42 GMT+0800 (China Standard Time)

Note that this was with 2h of audio

Fonk · Answer 3 · Tue Jun 20 2023 16:26:52 GMT+0800 (China Standard Time)

Hey there, was your dataset well prepared? The sounds must be correctly cut in 35-second parts, in .wav MONO 16bits depth and 32 000Hz, I was able to fine tune from the small model with my own music and I had relatively good results (hyperparameters of my training: lr 0.0000001, 10 epochs, use_wandb 1).

chavez · Answer 4 · Tue Jun 20 2023 22:45:15 GMT+0800 (China Standard Time)

Hey there, was your dataset well prepared? The sounds must be correctly cut in 35-second parts, in .wav MONO 16bits depth and 32 000Hz, I was able to fine tune from the small model with my own music and I had relatively good results (hyperparameters of my training: lr 0.0000001, 10 epochs, use_wandb 1).

Mind linking the wandb logs?

Fonk · Answer 5 · Tue Jun 20 2023 22:57:25 GMT+0800 (China Standard Time)

Hey there, was your dataset well prepared? The sounds must be correctly cut in 35-second parts, in .wav MONO 16bits depth and 32 000Hz, I was able to fine tune from the small model with my own music and I had relatively good results (hyperparameters of my training: lr 0.0000001, 10 epochs, use_wandb 1).

Mind linking the wandb logs?

I'm sorry, I wrote the argument but I did not even used it for this training. I'll share logs for a next one

Amilia · Answer 6 · Tue Jun 27 2023 15:38:23 GMT+0800 (China Standard Time)

Recently I tried to train musicgen on a TTS task and failed. I found that if I used a small dataset, val_loss would reach a still high minimum value. If I used a larger dataset, about 500h, val_loss and train_loss would drop at a very low rate

Karen Palacio · Answer 7 · Wed Jun 28 2023 07:10:12 GMT+0800 (China Standard Time)

I tried again with a small dataset but more compact (many examples with the same annotation) and was able to train it successfully by overfitting - i trained it for 30 epochs, all other hyperparameters being the default. Of course it resulted in a complete collapse of the model though - as in, it was afterwards unable to make any other prompt.

chavez · Answer 8 · Wed Jun 28 2023 11:43:06 GMT+0800 (China Standard Time)

@jidanhuang @karen-pal I think the solution could be to just use the original shape (1, 4, 1500) rather than the probs (1, 4, 1500, 2048)
I could try later but I dont have time right now

Amilia · Answer 9 · Wed Jun 28 2023 16:18:11 GMT+0800 (China Standard Time)

Sorry, but I don't understand which part of the code should be changed to use the original shape

jbm · Answer 10 · Sat Jul 01 2023 05:56:57 GMT+0800 (China Standard Time)

Currently only overfit works.
I was only able to test it on overfit twice at 3am before I pushed most of the code here so I supposed it was gonna work on longer datasets.
Although I think the main reason is data:

How does the model behave with overfitting? Is there some obvious failure mode? Like, does everything sound similar or does output just significantly degrade in audio quality?
I guess I'm just curious what you mean when you say that overfit "works"... ?

Karen Palacio · Answer 11 · Sat Jul 01 2023 11:08:14 GMT+0800 (China Standard Time)

Currently only overfit works.
I was only able to test it on overfit twice at 3am before I pushed most of the code here so I supposed it was gonna work on longer datasets.
Although I think the main reason is data:

How does the model behave with overfitting? Is there some obvious failure mode? Like, does everything sound similar or does output just significantly degrade in audio quality? I guess I'm just curious what you mean when you say that overfit "works"... ?

I also overfitted the model... The loss got very low and the model couldn't make anything but the fine tuning dataset.... Like it stopped being able to make any other sounds, no more lofi beats nor techno, nothing! I trained it on my voice to see if I could add background music and etc but it became impossible for the model to produce any other type of sound or modify a sound according to an instruction. You can hear the result here https://www.instagram.com/reel/CuDvXSetzKX/

jbm · Answer 12 · Wed Jul 05 2023 23:19:28 GMT+0800 (China Standard Time)

Ah, okay, so it loses any capacity for generalization. Makes sense. Just out of curiosity, what learning rate did you use? Also, did you use a learning rate scheduler—if so, which one?

EDIT — Sorry, yes, I see you used the default params.

Liujingxiu23 · Answer 13 · Mon Jul 17 2023 12:05:11 GMT+0800 (China Standard Time)

@karen-pal
Hi I want to train the model too and want same help.
When you use "lr 0.0000001", the model works? the lr is really small. Is the model really update the parameters using your dataset, or the model actually maintain the state of the pretrained model without updating?

And two weeks ago, you say "overfitted the model." with lr 0.0000001, you model overfitted?

How many hours of training data did you use？

Liujingxiu23 · Answer 14 · Mon Jul 17 2023 15:45:34 GMT+0800 (China Standard Time)

Recently I tried to train musicgen on a TTS task and failed. I found that if I used a small dataset, val_loss would reach a still high minimum value. If I used a larger dataset, about 500h, val_loss and train_loss would drop at a very low rate

@jidanhuang
Hi, how about your training of TTS using 500h dataset? Is the model converged? Can clear wave be synthesized?

chavez · Answer 15 · Mon Jul 17 2023 16:30:28 GMT+0800 (China Standard Time)

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

Liujingxiu23 · Answer 16 · Mon Jul 17 2023 16:59:27 GMT+0800 (China Standard Time)

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

Thank you for your relply!
I will update my trainning status after a few days

Liujingxiu23 · Answer 17 · Mon Jul 24 2023 10:35:12 GMT+0800 (China Standard Time)

I use https://github.com/neverix/musicgen_trainer for training，I use 35-hours music data to finetune on the pretrained small model, using 7 gpus with batchsize=10 for each gpu. The training loss continues decline to about 3.5，but the valid continues to rise to 6.0+,the generated waves seem good.

I'm thinking that whether the valid loss makes sense since we only use unconditional training mode？Or is the training make sense if the valid loss become larger and larger? @chavinlo

Karen Palacio · Answer 18 · Wed Sep 20 2023 02:09:34 GMT+0800 (China Standard Time)

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

I'm also going to be trying this fork.

I already trained a first model with only 5 min of transcribed speech to text and the results are promising... i was able to add spoken text with my voice and also add some musicality. I'll keep you updated with my next results!

Amilia · Answer 19 · Wed Sep 20 2023 08:51:35 GMT+0800 (China Standard Time)

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

I'm also going to be trying this fork.

I already trained a first model with only 5 min of transcribed speech to text and the results are promising... i was able to add spoken text with my voice and also add some musicality. I'll keep you updated with my next results!

This fork said that removing the gradient scaler, increasing the batch size and only training on conditional samples makes training work. so i just removed the gradient scaler and only trained on conditional samples on this chavinlo/musicgen_trainer project with the training batch size of 16. However, it it sound like there's no difference in training effect. I can only reduce the loss of val dataset to 3.88 with 500 hours music training dataset. The effect of music generation sounds neither good nor bad. I'm not sure if there's anything wrong with my code.
maybe i will try official training code and report my results.

chavez · Answer 20 · Thu Sep 21 2023 02:13:30 GMT+0800 (China Standard Time)

yeah I heavily suggest using the official repository as it now has training code
I don't plan on updating this repo right now either, too busy on other things