rishikksh20 / SoundStorm-pytorch

Google's SoundStorm: Efficient Parallel Audio Generation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problems with SoundStorm

rishikksh20 opened this issue · comments

Have trained update_v2 branch on :

  • Extracted Semantic token from HuBert Large layer 16 with 1024 cluster Kmean. (50 tok/sec)
  • Extracted Acoustic token from Encodec 24 khz sample rate, 240 hop length with 8 cookbook config from here. (100 tok/sec)

Results:
Output is not as desired, here is the sample first 6 sec is prompt.

This thread uses as a potential issue tracker and solution logs.

Hi, I think the sr and hop_size of HuBert and Encodec should be the same as told in SoundStorm's paper
image

so, maybe a 24kHz-240 hop_size HuBert is needed? or you should find a codec that matches the sr and hop_size of HuBert you used?

by the way, I'm trying https://github.com/yangdongchao/SoundStorm and want to use mHuBERT told in https://github.com/yangdongchao/SoundStorm/tree/master/data_process/semantic_token and HiFi-Codec-16k-320d who are both 16k-320d ( I'm refactoring yangdongchao/SoundStorm now and haven't start my preprocessing and training now

of cause, yangdongchao/SoundStorm has a lots of diff with the paper ...

Hi @yt605155624 , they are talking about token frame rate rather than audio sample rate we can use 16 khz semantic token to 24 khz acoustic token but there respective frame rate in token should be same. Although it better to make both semantic token and acoustic token with same sample rate audio, I already ask here for 24 khz semantic encoder.

But here, In my case sematic token is 50 tok/sec and acoustic token is 100 tok/sec, so I just repeat semantic token 2 times to match acoustic token sample rate.
I have implemented code as it is as in paper, potential issue might be following:

  1. semantic feature extraction and clustering
  2. Model weight initialization
  3. Training parameters like optimizer and scheduler, in paper author not well describe these training mechanisms.
  4. Sampling logic

@yt605155624 yes yangdongchao/SoundStorm entirely different codebase, I think they are not yet implemented SoundStorm training mechanism. Mine implementation is very close to the paper. I may make some silly mistake in logic otherwise code should work properly, will debug our code and will update you here.

Thanks

@rishikksh20 I am also training a slightly modified version of update_v2 branch it's been only 100k steps, generate function output is similar to your output but a single forward with greedy sampling has better sounding audio, I think the problem is with the sampling code, I think you modified MaskGit implementation to accommodate for a SoundStrom inference but I think it needs improvement since the MaskGit implementation you used itself has problem with generate's sampling logic in my experience, so if you can work on generate function sampling logic the output may improve.

And can I know

  1. how many steps you trained the model for?
  2. Dataset used?

Thanks

I trained for 100k only and used LibriTTS-100clean dataset just for logic check, can you share your sample with greedy only decoding? And what semantic encoder you are using ?

@bharani-y I already anticipated that sampling logic might be faulty. I will re-check that again.
Thanks

The following is a single step greedy only decoding output at around 80k steps link

Note: My training code is slightly modified from your code.

I think I used the same semantic encoder as you, I think sampling logic is not working as intended and the output will improve once it is fixed

Yeah audio is quit gibbrish may be training long will make quality better or may be greedy sampling is not a solution at all. I have also get similar result with greedy, one thing it may be possible that model does not work at all and in greedy inference it randomly predicts acoustic code.
can you share what you changed in training code ? and
what is the value range of cross-entropy error in your training ?

Greedy is not that great since the 100k training is too early for good results, even then I don't think greedy is the right solution, I just use it to test the model's raw performance.
My model size is small with just 8 layers and 3072 dimensions due to my low GPU memory, and I use a different scheduler than your's.
My training loss is around 2.05 and token accuracy is around 0.64 I think they both can improve
What is your current training loss? and do you have any idea where the sampling code is failing?

Yes Greedy is not enough, because this task is many to many conversions which is quite hectic for single run greedy solutions.
My training loss is around 5 (ideally loss should be less than 1.0) which is quite high I used default maskgit's training mechanism which I know not going to work properly, But I think the real issue might be with semantic token extraction, as I trained HuBert Kmean from scratch rather than using provided pre-trained models.
I am also planning to do some exp with training parameters like:

  1. Use inverse square root schedular.
  2. Use Adafactor with inverse square root schedular.
  3. Use label smoothing
  4. FS2's Noam Schedular

Can you share your training parameter like, optimizer, schedular, learning rate and batch size as I include those also in my experimentation config list.

I got similar results with my own SoundStorm implementation. It did figure out the silence (the pauses) from the semantic tokens and sometimes it would predict single sounds correctly (like "it") but most of it would sound exactly like @rishikksh20 sample.

I was using 16 steps of decoding and training only on the 1st acoustic token to simplify the task significantly. How much data are you training on? And how large are the models?

PS. I dived into some of the NAR (non-autoregressive) machine translation papers and the consensus was that training a NAR model (and they use even more "tricks" than SoundStorm) is significantly more difficult than AR. And the biggest trick they all used was to train an AR model first and then generate an alternative "teacher" training set with it...

@rishikksh20 Those are some interesting tricks to try, I am using Adam optimizer, lr=0.0001, scheduler=exponentialLR, batch size=8(I know it is low but I only have 3060 GPU), I just left the training to run and will see how it goes

Greedy sampling is not the right way to go, but sampling like MaskGIT if implemented properly may improve results

@jpc yes you are right, NAR are much more difficult to train than AR models, but AR models are highly memory intensive and due to my current limitation I am only trying with NAR but you are idea to train AR first and then generate and 'teacher' training set seems interesting can you elaborate the process a little bit?

@bharani-y Teacher training set is a knowledge distillation method which called teacher student learning, where we are training a large Teacher model to learn probability distribution of the complex data, after training we generate synthetic data from teacher model and then we train small student model on that data. Similarly, as now people finetune open source LLM on data generated from ChatGPT.

PS. I dived into some of the NAR (non-autoregressive) machine translation papers and the consensus was that training a NAR model (and they use even more "tricks" than SoundStorm) is significantly more difficult than AR. And the biggest trick they all used was to train an AR model first and then generate an alternative "teacher" training set with it...

Yes semantic to acoustic token is many to many modellings so it not going to train easily with NAR, Spear-TTS's semantic to acoustic token model is AR version of SoundStorm. Quality with Quantity of data, training parameters and sampling methods are very important when training NAR models I am currently investigating that.

In my first training, I trained model on LibriTTS 100 hours clean dataset and model config is similar to paper.

Thanks for the clarification I am currently training on Librispeech dataset, I am also planning on looking into training parameters but I am busy with another work and will let the current training continue in the background to see how longer training improves the model

@bharani-y are you same dataloader and variable random window as in my repo ?
Also which semantic tokenizer you are using and how many clusters in your semantic dataset ?

I use my own dataloader but the core logic behind is almost the same, I used Hubert base k-means cluster model available on fairseq I think you can get the cluster details there, I know the tokens/sec is different so I used token upsampling to match sample length, other than that the dataloader logic is similar to yours, do you have any ideas about improving sampling logic?

I left the training to run for more than 350k steps but there is no improvement in the generate function output and greedy is also worsening I think the initial greedy result is maybe due to random tokens just aligning by chance since I am not able to repeat the result in later epochs, any progress on modifying the sampling logic?

Have you considered using token critic method, it may help improve the output

Currently I am training model on Large Libri-TTS dataset from here : https://huggingface.co/datasets/collabora/whisperspeech/tree/main

I use my own dataloader but the core logic behind is almost the same, I used Hubert base k-means cluster model available on fairseq I think you can get the cluster details there, I know the tokens/sec is different so I used token upsampling to match sample length, other than that the dataloader logic is similar to yours, do you have any ideas about improving sampling logic?

From my experiment I confirm that just doing some very naive semantic tokens upsampling using hubert (50hz) to match encodec (75hz) works, can get some reasonable voices. My own implementation trained on LJSpeech can get audio that ASR can know what is being said, however the quality is still very low. It is a sign that simple 1.5x upsampling is at least a solutoin. My implementation is based on Lucidrain's WIP repo, and the current blockade is that the model still has low training accuracy (60% @ top 10 on masked tokens during training)

below I provide the core code and one sample, which I think is very close to the paper's description
https://github.com/feng-yufei/shared_debugging_code/blob/main/soundstorm.py, hope it can be useful

@rishikksh20 Great, let us know how it performed, are you using same optimizer and scheduler as before?

@feng-yufei The sample sounds much better than mine, I too tried Lucidrain's implementation his generate function is different from the method described in the process mentioned in the SoundStorm paper but judging from you sample it seems to be working too, have you tried training multi-speaker models or using the same LJSpeech model to generate other speakers voices? I want to know the multi-speakers performance of the above implementation, and can you share more details about the training like optmizer, model config, and your final loss.

Thanks

From my experiment I confirm that just doing some very naive semantic tokens upsampling using hubert (50hz) to match encodec (75hz) works, can get some reasonable voices. My own implementation trained on LJSpeech can get audio that ASR can know what is being said, however the quality is still very low. It is a sign that simple 1.5x upsampling is at least a solutoin. My implementation is based on Lucidrain's WIP repo, and the current blockade is that the model still has low training accuracy (60% @ top 10 on masked tokens during training)

Hi @feng-yufei ,
Yes you can generate 24 khz acoustic token conditioned on 16 khz semantic token just to make sure both have similar hop length, we can match different hop length through upsampling.
For getting good quality of speech you needed to train model on larger dataset, these big LLM models are data hungry, you can same model on this data : https://huggingface.co/datasets/collabora/whisperspeech/tree/main
it already pre-processed, semantic token is in 50 tok/sec and acoustic token 75 tok/sec.

below I provide the core code and one sample, which I think is very close to the paper's description
https://github.com/feng-yufei/shared_debugging_code/blob/main/soundstorm.py, hope it can be useful

@feng-yufei sample is very good, can you share you training logic and dataloader code, because I guess basic logic of lucidrain's code and mine code is similar only place I am struggling in training and dataloader script which is not properly shared in Soundstorm paper, By the way what cross entropy loss you are getting and how much you have trained the model to generate this sample.

below I provide the core code and one sample, which I think is very close to the paper's description
https://github.com/feng-yufei/shared_debugging_code/blob/main/soundstorm.py, hope it can be useful

@feng-yufei sample is very good, can you share you training logic and dataloader code, because I guess basic logic of lucidrain's code and mine code is similar only place I am struggling in training and dataloader script which is not properly shared in Soundstorm paper, By the way what cross entropy loss you are getting and how much you have trained the model to generate this sample.

Just providing a little bit more detail here, I use hubert_base_ls960 checkpoint and do the 1024 centroid kmeams using the example code in fairseq/hubert. The acoustic tokens are from encodec. I am using batchsize 48, 1 gpu, gradient accumulation 4 and 500 epoch on LJspeech. Optimizer is AdamW lr=0.0002, exponential lr decay 0.999875, fp32 training. model has dim 256 layer 6 in lucidrain's setting (you can refer to the code, though I think lucid's transformer initialization is a little bit weird). The cross entropy loss start with 3.2 at the end of first epoch (taking average of all masked tokens at the selected layer), and gradually fall to 1.8-2.0. To deal with the encodec 75hz and hubert 50hz mismatch, I upsample the hubert embedding 1.5x ,(like the 6th acoustic token is aligned with 4th semantic token, and for non-integer align I take weighted average of its neighbour) . For experiment on larger dataset I tried LibriTTS 100/360/500 merged together, the quality is strangely bad.(50% top 10 training accuracy while LJspeech has 65%). From my experience increasing the model size and datasize does not bring any benefits so I suspect there is still bugs. I am currently reviewing your update_v2 to find something I may omit. I already tried migrate your conformer code to mine and the result is the same.

From my experiment I confirm that just doing some very naive semantic tokens upsampling using hubert (50hz) to match encodec (75hz) works, can get some reasonable voices. My own implementation trained on LJSpeech can get audio that ASR can know what is being said, however the quality is still very low. It is a sign that simple 1.5x upsampling is at least a solutoin. My implementation is based on Lucidrain's WIP repo, and the current blockade is that the model still has low training accuracy (60% @ top 10 on masked tokens during training)

Hi @feng-yufei , Yes you can generate 24 khz acoustic token conditioned on 16 khz semantic token just to make sure both have similar hop length, we can match different hop length through upsampling. For getting good quality of speech you needed to train model on larger dataset, these big LLM models are data hungry, you can same model on this data : https://huggingface.co/datasets/collabora/whisperspeech/tree/main it already pre-processed, semantic token is in 50 tok/sec and acoustic token 75 tok/sec.

If I am understanding it correctly, hubert and encodec has different hop-size (1:1.5)so we can never align them perfectly using existing pretrained checkpoints. I am still in debugging stage and train on LibriTTS gives bad results.

For experiment on larger dataset I tried LibriTTS 100/360/500 merged together, the quality is strangely bad.(50% top 10 training accuracy while LJspeech has 65%).

I have also trained on LibriLight large subset from here : https://huggingface.co/datasets/collabora/whisperspeech/tree/main
after 100k steps at bs: 24 got top 1 accuracy ~ 27 % to 30 % and top 10 accuracy ~ 55% to 63 % and generated audio is abysmal, nothing in the audio just noise. Model perform very poor on LibriTTS model.

I noticed that the model can generate audio that sounds better than the ground truth (both at 2 quantizers) with the accuracy of the free running (without teacher forcing) generation at only 4%. My guess is that there is a lot of redundancy in the acoustic tokens and a lot of valid combinations of well sounding tokens…

For experiment on larger dataset I tried LibriTTS 100/360/500 merged together, the quality is strangely bad.(50% top 10 training accuracy while LJspeech has 65%).

I have also trained on LibriLight large subset from here : https://huggingface.co/datasets/collabora/whisperspeech/tree/main after 100k steps at bs: 24 got top 1 accuracy ~ 27 % to 30 % and top 10 accuracy ~ 55% to 63 % and generated audio is abysmal, nothing in the audio just noise. Model perform very poor on LibriTTS model.

Hi Rishikksh20, your training codes is working well on my data pipeline (I modified the code a little bit to fit my data), for inference I made a new version that combined yours and lucidrain's inference code, and it gives samples even slightly better than what I already have. code for reference https://github.com/feng-yufei/shared_debugging_code/blob/main/soundstorm2.py

@rishikksh20 Can you share your training details of the model that gets 63% on LibriLight? how many gpu are used and the batch size is 24 in total or per gpu? Is gradient accumulation set to 10 or 1?
Is the acc curve growing very slowly above 50? is the model parameters and optimizer param (lr) the same as update_v2 SoundStorm.init default (12 layers) or the same as args (24 layers)? I tried a large model with D1024L12H12FF2048 (228M param) on LibriTTS 100/360/500 and seems the curve stops increasing when it hits 52%. It works perfectly as expected on LJSpeech and hit 76% though.

Hi @feng-yufei
Thanks to let me know my code works for LJSpeech.
For my Training, I have train my model on Librilight Large subset and my model config is same as paper, and batch size is around 16 and I random trim each sample to 1250 tokens (~ 17 sec) , I use AdamW optimizer with same setting as VITS paper.
I got top 10 accuracy fluctuating between 55 to 63 %.

@feng-yufei @jpc
I think this guy implemented SoundStorm perfectly: https://github.com/lifeiteng/SoundStorm , he has shared samples not code so far.

We’ll have to take his word for it ;)

I found two thing that I think will help a lot – the uP parametrization from Microsoft/OpenAI helps to scale training without stability issues. In suspect all the papers we are trying to reproduce used something similar without mentioning it (it’s curious they never mention initialization once).

The second thing is the vocos vocoder which recently started supporting EnCodec inputs (instead of Mel spectrograms) and gives very good quality with just 2 tokens (and amazing with 4).

Hi @feng-yufei Thanks to let me know my code works for LJSpeech. For my Training, I have train my model on Librilight Large subset and my model config is same as paper, and batch size is around 16 and I random trim each sample to 1250 tokens (~ 17 sec) , I use AdamW optimizer with same setting as VITS paper. I got top 10 accuracy fluctuating between 55 to 63 %.

After some investigation today, I think the code is ok, the problem I had is that with default params in the paper my training is not very stable, getting suboptimal results. With L12H64D768FF1024 (108M param) I am able to get a top 10 accuracy over 60% so I think your training code is good to go!

@feng-yufei are you able to get some audible audio from your model With L12H64D768FF1024 (108M param) I am able to get a top 10 accuracy over 60%

@jpc I will look at vocos .

We’ll have to take his word for it ;)

He is training SoundStorm on phoneme sequence rather than semantic tokens, I think he combined VALL-E with SoundStorm as he is also implemented Vall-E before SoundStorm.
I think he replaced VALL-E's Neural Codec Language Modelling module with SoundStorm.

yes I can get audible audio with the code I shared, now the only problems is that the larger model dim 1024 performs worse than dim 768 on libriTTS, no matter what I did to tune learning rate and batch accumulation. Can you share how many accumulation is used and how many gpu is used?

@feng-yufei I used accum_grad = 4
I think problem with LibriTTS on 1024 dim might be the underfitting, these LLM kind of model follows Neural Scaling laws where Model size should increase with data otherwise it won't perform any good. LibriTTS on max only has 500 hrs of audio which might be not enough for 1024 dim model, I guess.

Hi Rishikksh20,
I am now able to get expected results after adding a proper warm-up schedule. Trained on LibriTTS 500h, The inference code produced good sound on libritts dev set with prompt, and with a simple ar text to semantic model the entire pipeline of soundstorm can produce audio that has a good WER.

Great @feng-yufei ,
Please share your updated model code and training parameters I will update this repo accordingly. I will also train model on LibriLight and will upload checkpoint here.
Also please share latest sample quality.

Thanks

with a simple ar text to semantic model the entire pipeline of soundstorm can produce audio that has a good WER.

You can get AR Text to Semantic from here: https://github.com/collabora/spear-tts-pytorch

Please share your updated model code and training parameters I will update this repo accordingly.

Please let me know which scheduler you are using for better sample quality
@feng-yufei

@feng-yufei Thanks, will compare with my code and update this repo.

VampNet : https://github.com/hugofloresgarcia/vampnet
Also based on MaskGIT and some component might help us to improve this repo.

Hi @feng-yufei Have you test your model on LibriLight dataset ?

@feng-yufei I am also facing resource issue, I don't think model will converge on LibriTTS as usually non-autoregressive model needs extremely large amount of data for converge.
Can you share the best quality audio you get so far from LibriTTS training ?