nadavbh12 / VQ-VAE

Minimalist implementation of VQ-VAE in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve results on cifar - nearest neighbor should be performed to 10 dictionaries rather than 1

pclucas14 opened this issue · comments

Hi,

I'm trying to improve results on CIFAR. I see you already have some potential improvements in mind. Could you help me understand what you mean by "Improve results on cifar - nearest neighbor should be performed to 10 dictionaries rather than 1" ? How would you combine the 10 dictionaries during training / testing ?

Thanks!
Lucas

Hi Lucas,
This note refers to how the VQ-VAE was actually trained in the paper.
I didn't get that in the first (few) reading, so I confirmed it with the authors.

For imagenet, the encoder's output is a tensor of size 8x8x64.
If you have only one codebook than for each of the 64 (=8x8) latents you perform nearest neighbor with the codebook, build a new 8x8x64 tensor and pass it on to the decoder.
For CIFAR10, where you have 10 codebooks, the encoder's output is a tensor of size 10x8x8x64.
Running through the first dimension, for each of the 64 (8x8) latents you perform nearest neighbor with its own codebook.
This way, every spatial location can pack more information.

I see. Thanks for the explanation!