ZhangXInFD / soundstorm-speechtokenizer

Implementation of SoundStorm built upon SpeechTokenizer.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does soundstorm perform obviously better than original vall-e NAR?

Jiang-Stan opened this issue · comments

Hi, Thanks for your excellent job!

As listed in your repo, by replacing NAR by soundstorm-speechtokenizer, the performance(Speaker Similarity) improved a lot. But when I try the demo on your web page, it is hard to determine the difference with and without SoundStorm.

So I am wondering if the proposed soundstorm-speechtokenizer can improve the failure cases of original USLM?

Thanks a lot in advance, hoping to hear your reply!

Yes, the SoundStorm improve more on the other cases. The model weights trained on LibriSpeech will be released in next week. You can try it at that time.

Thanks for your reply!

I am trying to reproduce your experiment using Chinese. By saving tokens generated by vall-e tokenizer.py from lhotse cutset(h5 file) into numpy format and generate train_file_list.txt, but it seems that the model failed to converge. Loss dropped from 6.8 to ~5.4 and don't drop anymore. Is it a normal phenomenon? or do I failed to notice some detailes?
image

An example of code generated by SpeechTokenizer is as shown below:
image

The training steps are too few. The losses may continue to decline if you keep training. And initializing SoundStorm Embedding layer is with tokenizer's codebook is very important for convergence. By the way, do you train SoundStorm using EnCodec's tokens and use the first layer of Encodec's RVQ as condition? It seems not feasible in therory since SoundStorm is used to produce acoustic tokens conditioned on semantic tokens, but semantic information and acoustic information are not disangle in EnCodec.
image

Thanks a lot for your help!

I have checked that the SoundStorm Embedding layer is successfully initialized by SpeechTokenizer, and tokens for training is all generated by SpeechTokenizer (Which has successfully improve the performance of TTS using vall-e).

By increasing the batch size(set to 64), I successfully observe the loss reduction.
image

But the training speed is so slow... It takes me about 12 hours training 13000 iters last night(grad_accum_step=1). How long does it take for you to train 400k iters?

It takes nearly 4.5h for every 10k iters during training in my experiment. I used 2 3090 GPUs with batchsize=8 per gpu and grad_accum_step=2. By the way, I'm uncertain about the performance employing SpeechTokenizer on Chinese, since both SpeechTokenizer and semantic teacher, namely HuBERT, are trained on purely English. I'm really interested in your experimental results on Chinese. Could you inform me of your results once you have more findings?

In my experiment for Chinese TTS on WenetSpeech-M(vall-e codebase), the NAR Top10 acc increase from ~62 to ~72 by replacing Encodec to SpeechTokenizer. The improvement brought by WenetSpeech is remarkable even without Chinese finetuning.

I also noticed that HuBERT is trained on English only, so I tried to finetune SpeechTokenizer using Chinese HuBERT(From Tencent) on WenetSpeech, but failed to converge yet... I'm not familiar with adversarial learning.