[Feature] why use a GroupVQ not a simple VQ ?

Question

[Feature] why use a GroupVQ not a simple VQ ?

BridgetteSong opened this issue a month ago · comments

Can you try to compress mels to a token seq of shape [L, 1] by a simple VQ like VQ-VAE or FSQ not a GroupVQ? if some results you had made, what are reasons for using GVQ?

Leng Yue · Answer 1 · Tue May 14 2024 15:55:55 GMT+0800 (China Standard Time)

GVQ can include more info than naive one.

BridgetteSong · Answer 2 · Tue May 14 2024 16:05:30 GMT+0800 (China Standard Time)

To keep more info in VQ, we can increase codebook size, egs from 1024 to 8192 like Tortoise, because a token seq of [L, 1] is easy for LM training and optimization. Have you tried to a simple VQ for FishSpeech training? Or in the beginning only GVQ is used?

Leng Yue · Answer 3 · Tue May 14 2024 19:30:06 GMT+0800 (China Standard Time)

Two 1024 codebooks = one 1024 * 1024 ~ 1M codebooks. It's not equal to a single codebook with 2048.