Improve results on cifar - nearest neighbor should be performed to 10 dictionaries rather than 1

Question

Improve results on cifar - nearest neighbor should be performed to 10 dictionaries rather than 1

pclucas14 opened this issue 5 years ago · comments

Hi,

I'm trying to improve results on CIFAR. I see you already have some potential improvements in mind. Could you help me understand what you mean by "Improve results on cifar - nearest neighbor should be performed to 10 dictionaries rather than 1" ? How would you combine the 10 dictionaries during training / testing ?

Thanks!
Lucas

nadavbh12 · Answer 1 · Sun Jun 16 2019 01:53:21 GMT+0800 (China Standard Time)

Hi Lucas,
This note refers to how the VQ-VAE was actually trained in the paper.
I didn't get that in the first (few) reading, so I confirmed it with the authors.

For imagenet, the encoder's output is a tensor of size 8x8x64.
If you have only one codebook than for each of the 64 (=8x8) latents you perform nearest neighbor with the codebook, build a new 8x8x64 tensor and pass it on to the decoder.
For CIFAR10, where you have 10 codebooks, the encoder's output is a tensor of size 10x8x8x64.
Running through the first dimension, for each of the 64 (8x8) latents you perform nearest neighbor with its own codebook.
This way, every spatial location can pack more information.

Lucas Caccia · Answer 2 · Mon Jun 17 2019 12:02:40 GMT+0800 (China Standard Time)

I see. Thanks for the explanation!