facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Watermark model slow training (cross-posted from facebookresearch/audioseal)

christianc102 opened this issue · comments

Hi!

(This was cross-posted at facebookresearch/audioseal, but wanted to also put here for visibility--thanks!)

Thanks so much for the helpful training code and documentation. Apologies in advance for the naive question--I'm pretty new to machine learning.

I'm trying to train my own watermarking model at 48kHz with my own dataset on an H100 node with 8 GPUs (H100 80GB HBM3) on a remote SLURM cluster, but as I scale the batch size the training speed appears to drop proportionally. There also appears to be an unexpected behavior where I specify dataset.batch_size=k but the submitted config (logged by wandb) shows dataset.batch_size=k/8.

As an example, I ran experiments setting dataset.batch_size=8, which became dataset.batch_size=1, yielding a max training speed of about 1.67 steps / second and GPU utilization reaching averaging around 25%. When I set dataset.batch_size=128 (to yield dataset.batch_size=16), training speed dropped to around 0.3 steps / second. It seems to me that parallelization isn't working the way it should based on these results?

I've tried preprocessing my dataset to one-second clips and removing some of the augmentations (even running an experiment with only noise augmentations) to try to increase GPU utilization, but nothing I've tried has improved the training speed.

Is this to be expected? Roughly how long did the original AudioSeal model take to train, using what amount of compute?

Thank you so much!

Hi! , can you paste here your run command so i am sure you are doing it right?

As an example, I ran experiments setting dataset.batch_size=8, which became dataset.batch_size=1, yielding a max training speed of about 1.67 steps / second and GPU utilization reaching averaging around 25%. When I set dataset.batch_size=128 (to yield dataset.batch_size=16), training speed dropped to around 0.3 steps / second. It seems to me that parallelization isn't working the way it should based on these results?

this seems normal to me the batch_size you add as an argument is the effective batch size, is internally divided between all gpus. If i understand correctly it is normal for step/sec to drop if you increase batch size because the step now has more samples to compute.

have you tried to plot convergence curves between the bsz?

Roughly how long did the original AudioSeal model take to train

Original training took from 3-10 days to to obtain good results on 4 gpus machine. But after 20-40 hours you could see it converging already.

@hadyelsahar
Hello, thank you very much for your work. Are there more details about the training? The 400k Voxpopuli dataset is too large for me. I hope to verify the watermarking effects on a smaller dataset. In fact, I have trained for about 10 epochs on a 200-hour dataset, but there is no effect. So I would like to know the minimum effective dataset size in terms of hours. Thank you again.

but there is no effect.

it will help a lot if you can share your evaluation metrics, you can find them in the dora log directory ./history.json

The 400k Voxpopuli dataset is too large for me.

Note here that in AudioCraft epoch is just a predefined # of steps not the whole training data, we set the default = 2000 steps . so the size of your training data basically doesn't affect the time taken per epoch it just affects the pool of samples that your training comes from.

updates_per_epoch: 2000

We don't use the full 400k hours on vox populi we select 5k hours, with which you can find good performance in about 80-100 epochs, we let our run till 200-300 epochs.

I think the training could be made a bit more efficient indeed, but we have not focused on it that much...

Hello, thank you very much for your work. Are there more details about the training? The 400k Voxpopuli dataset is too large for me. I hope to verify the watermarking effects on a smaller dataset. In fact, I have trained for about 10 epochs on a 200-hour dataset, but there is no effect. So I would like to know the minimum effective dataset size in terms of hours. Thank you again.

@Comedian1926 , if you want to study the watermark training at a smaller scale, what you can do is focus on some augmentations, and remove the compression ones -- for them, we need to transfer to CPU, save with the new format, load, and transfer back to GPU, so they take a lot of time.

What we observed during training is that the detection (and localization) accuracy increases very fast, in 10 epochs or even less. For the rest of the epochs, all metrics increase at a steady rate (notably the audio quality metrics).
Here is an example of some of the validation metrics (each point here is for 10 epochs since we computed validation metrics every 10epochs -- so 20 means 200 epochs).
308475502-618403f4-5654-4892-9fe1-ebf12b55f3a5.

@pierrefdz @hadyelsahar Thank you very much for your reply, it is very useful to me. My previous training mainly had the d_loss between 1.98 and 2, and I feel it did not converge. I am currently restarting the training and will synchronize the log to you, hoping to succeed. Thank you again for your work~

@hadyelsahar @pierrefdz
Hello, I've trained another model, but it still doesn't seem to be converging
My hardware is 3090 x 2
The training data is Voxpopuli subset 10k en
I've also experimented with adjusting the learning rate and batch size using a single card, but it didn't yield satisfactory results
This is the hyperparameter and log for training
history.json
hyperparams.json
spec_7 (2).pdf
I appreciate any advice you can offer. Thank you in advance.