fishaudio / fish-speech

Brand new TTS solution

Home Page:https://speech.fish.audio

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature] why use a GroupVQ not a simple VQ ?

BridgetteSong opened this issue · comments

Can you try to compress mels to a token seq of shape [L, 1] by a simple VQ like VQ-VAE or FSQ not a GroupVQ? if some results you had made, what are reasons for using GVQ?

GVQ can include more info than naive one.

To keep more info in VQ, we can increase codebook size, egs from 1024 to 8192 like Tortoise, because a token seq of [L, 1] is easy for LM training and optimization. Have you tried to a simple VQ for FishSpeech training? Or in the beginning only GVQ is used?

Two 1024 codebooks = one 1024 * 1024 ~ 1M codebooks. It's not equal to a single codebook with 2048.