Class-Conditional Free Text-Guided Generation

Question

Class-Conditional Free Text-Guided Generation

Jake-wei opened this issue 3 months ago · comments

Thank you for your great work. Your implementation is based on class-conditional MAR models, which are limited. Is there a plan to train an MAR model with class-conditional free text guidance?

Jake-wei · Answer 1 · Wed Jul 31 2024 17:18:40 GMT+0800 (China Standard Time)

It seems that the shape of the noise tensor is constrained by the token embedding dimension during inference, so the MAR model can only generate images with a size of 256x256?

Tianhong Li · Answer 2 · Wed Jul 31 2024 17:53:03 GMT+0800 (China Standard Time)

Thanks for your interest. Our models are trained on ImageNet, which is a class-conditional image generation benchmark. Text-to-image generation requires training on large-scale image-text dataset and is an interesting future direction.

MAR could be potentially used for image outpainting as it is based on AR (similar to MaskGIT). In this way, you can generate images with any size by keeping outpainting existing image.

Jake-wei · Answer 3 · Fri Aug 02 2024 11:01:46 GMT+0800 (China Standard Time)

Thanks for your interest. Our models are trained on ImageNet, which is a class-conditional image generation benchmark. Text-to-image generation requires training on large-scale image-text dataset and is an interesting future direction.

MAR could be potentially used for image outpainting as it is based on AR (similar to MaskGIT). In this way, you can generate images with any size by keeping outpainting existing image.

In the sample_tokens function of the MAR class, the seq_len is set to 256, which determines the generation resolution to be 256. Can the seq_len be set to others value to generate images beyond 256?

Tianhong Li · Answer 4 · Fri Aug 02 2024 11:24:30 GMT+0800 (China Standard Time)

The seq_len is set to 256 and cannot be changed because it is correlated with the learned position embeddings. One trick you can play (similar to MaskGIT) is that once you generate a 256x256 image, you can take the right half of it as the unmasked left half of a new image and generate the right half of the new image again. This will extend your image to 256x384, and this is what I meant by "keeping outpainting existing image".