Class-Conditional Free Text-Guided Generation
Jake-wei opened this issue · comments
Thank you for your great work. Your implementation is based on class-conditional MAR models, which are limited. Is there a plan to train an MAR model with class-conditional free text guidance?
It seems that the shape of the noise tensor is constrained by the token embedding dimension during inference, so the MAR model can only generate images with a size of 256x256?
Thanks for your interest. Our models are trained on ImageNet, which is a class-conditional image generation benchmark. Text-to-image generation requires training on large-scale image-text dataset and is an interesting future direction.
MAR could be potentially used for image outpainting as it is based on AR (similar to MaskGIT). In this way, you can generate images with any size by keeping outpainting existing image.
Thanks for your interest. Our models are trained on ImageNet, which is a class-conditional image generation benchmark. Text-to-image generation requires training on large-scale image-text dataset and is an interesting future direction.
MAR could be potentially used for image outpainting as it is based on AR (similar to MaskGIT). In this way, you can generate images with any size by keeping outpainting existing image.
In the sample_tokens function of the MAR class, the seq_len is set to 256, which determines the generation resolution to be 256. Can the seq_len be set to others value to generate images beyond 256?
The seq_len is set to 256 and cannot be changed because it is correlated with the learned position embeddings. One trick you can play (similar to MaskGIT) is that once you generate a 256x256 image, you can take the right half of it as the unmasked left half of a new image and generate the right half of the new image again. This will extend your image to 256x384, and this is what I meant by "keeping outpainting existing image".