MoVQ implementation question

Question

MoVQ implementation question

kebijuelun opened this issue 7 months ago · comments

I have some questions regarding the implementation of MoVQ and would appreciate your clarification.

From the original MoVQ paper, it is mentioned that a multi-channle VQ is adopted.

However, the implementation of kandinsky3 does not involve any vector quantization operation:

class MoVQ(nn.Module):
    
    def __init__(self, generator_params):
        super().__init__()
        z_channels = generator_params["z_channels"]
        self.encoder = Encoder(**generator_params)
        self.quant_conv = torch.nn.Conv2d(z_channels, z_channels, 1)
        self.post_quant_conv = torch.nn.Conv2d(z_channels, z_channels, 1)
        self.decoder = Decoder(zq_ch=z_channels, **generator_params)
        
    @torch.no_grad()
    def encode(self, x):
        h = self.encoder(x)
        h = self.quant_conv(h)
        return h

    @torch.no_grad()
    def decode(self, quant):
        decoder_input = self.post_quant_conv(quant)
        decoded = self.decoder(decoder_input, quant)
        return decoded

May I ask if it is a misunderstanding on my part regarding MoVQ, or if Kandinsky has made some modifications to the implementation of MoVQ?

pekinghk · Answer 1 · Wed Mar 13 2024 14:57:09 GMT+0800 (China Standard Time)

I've noticed the same issue. It appears that the author has just utilized the encoder and decoder network architecture from movqgan and has retrained a VAE model. I'm not sure if my understanding is correct, so I would appreciate clarification from the author.

Terry Zhang · Answer 2 · Wed Mar 13 2024 19:45:35 GMT+0800 (China Standard Time)

After training, each token of the encoder's output is very close to a certain vector in the codebook. I've tried adding the VQ to refine the encoder's output and find the MSE between the features before VQ and after VQ is small, thus w/ VQ or w/o VQ produces nearly the same decoding results.