Inquiry About Training and Inference in RCG Model from Your Recent Publication

Question

Inquiry About Training and Inference in RCG Model from Your Recent Publication

Yuan1z0825 opened this issue 8 months ago · comments

I recently read your fascinating paper titled "Self-conditioned Image Generation via Generating Representations" and have a question regarding the training and inference processes of the RCG model, particularly about the image masking strategy.
In the paper, it's mentioned that during training, the pixel generator is trained with partially masked images. However, during inference, images are fully masked. I am curious about how this difference in masking (partial during training and full during inference) affects the model's performance and its ability to reconstruct images.
Your insights into this aspect of the RCG model would be greatly appreciated, as it would deepen my understanding of your novel approach.

Tianhong Li · Answer 1 · Tue Dec 19 2023 21:47:53 GMT+0800 (China Standard Time)

Thanks for your interest! During training, the masking ratio is randomly selected from 50%-100%, so it covers both the fully-masked scenario and the partially-masked scenario. We use a multi-step parallel decoding strategy during inference, which means that the image is generated starting from a 100% masked image, and is gradually filled in until all masked tokens are generated. You might refer to the MaskGIT and MAGE paper for more detailed illustrations of the parallel decoding strategy.

Yuan Yizhe · Answer 2 · Tue Dec 19 2023 21:52:56 GMT+0800 (China Standard Time)

Thank you for your thoughtful answers to my questions. I will carefully look into the work on MaskGIT and MAGE.