lucidrains / DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is the image input for inference?

SenHe opened this issue · comments

Thanks for this great work!

After going through the code, I got some questions.

  1. In the first stage of training discrete VAE, we already trained a code book. Why we don't use it for second stage training but initialize a new code book for images.

  2. During training, we use the original image as input. During inference, how to set the image input? Is it a random noise with size 3x256x256? How do we do the casual attention in transformer for inference?

After reading codes, I also don't know why there is another new code book. Do you have any idea now?