The Illustrated VQGAN

Question

The Illustrated VQGAN

utterances-bot opened this issue 3 years ago · comments

The Illustrated VQGAN

VQGAN allows us to generate high-resolution images from text, and has now taken art Twitter by storm. Let me talk about how it works on a conceptual level in...

https://ljvmiranda921.github.io/notebook/2021/08/08/clip-vqgan/

tnwei · Answer 1 · Tue Aug 24 2021 23:21:50 GMT+0800 (China Standard Time)

Great article! Especially liked the Excalidraw visuals and your interpretation on how the community came to arrive at VQGAN-CLIP.

AFAIK there were attempts before VQGAN that utilized BigGAN, not too sure how we ended up with VQGAN being the dominant variation in the circulated Colab notebooks

Lj Miranda · Answer 2 · Sat Aug 28 2021 09:46:07 GMT+0800 (China Standard Time)

Hi @tmwei thanks a lot!

AFAIK there were attempts before VQGAN that utilized BigGAN,

Agree, I noticed when I was compiling a list of CLIP+xGAN implementations that there were initial versions using BigGAN.

not too sure how we ended up with VQGAN being the dominant variation in the circulated Colab notebooks

Your guess is as good as mine! I'd wager that social media played a big role? It's the perfect storm: generative artists minting NFTs, promoting generative artwork on Twitter, new Colab notebooks by @crowsonkb etc.

Katherine Crowson · Answer 3 · Sat Aug 28 2021 20:41:16 GMT+0800 (China Standard Time)

I think VQGAN ended up being the dominant version because people liked the aesthetics of the outputs better. Also the ratio of good outputs to bad outputs for a given prompt is considerably better. My z+quantize method is particularly popular (I have two VQGAN methods) and I think the reason for that is it produces outputs that begin to look like the prompt in as few as 50 iterations. People like the instant gratification.

Lj Miranda · Answer 4 · Mon Aug 30 2021 20:48:39 GMT+0800 (China Standard Time)

Hi @crowsonkb thanks a lot for enlightening us 🙇 Looking back, I agree, the aesthetics of VQGAN appears to be better. Instant gratification adds to the "wow factor"

Thanks again for your huge contributions to the generative art community! Your Colab notebooks have inspired and awed many :)

tnwei · Answer 5 · Sat Sep 04 2021 21:16:26 GMT+0800 (China Standard Time)

u/Wiskkey is tallying the explosion of VQGAN projects for image generation at https://www.reddit.com/user/Wiskkey/comments/p2j673/list_part_created_on_august_11_2021/.

Given the recent release of guided diffusion notebooks, it might not be too long before that becomes the mainstream backbone instead. Interesting times for AI art!

Opher Garver · Answer 6 · Mon Sep 13 2021 00:18:13 GMT+0800 (China Standard Time)

Thank you so much for this article. I have a long way to go, but this article gave me a solid foothold at the base of the mountain :)

Lj Miranda · Answer 7 · Mon Sep 27 2021 11:46:13 GMT+0800 (China Standard Time)

Hey @smawpaw glad it helped, wishing you well in your journey :)

tjwangml · Answer 8 · Wed Sep 29 2021 13:16:15 GMT+0800 (China Standard Time)

This is the best article I've seen so far explaining the architecture. Thank you so much LJ!

I originally thought it was the GAN inside the VQGAN that is doing the image synthesis. But it turns out to be the Transformer that's doing the job? Would you mind sharing some posts or videos about how the Transformer in VQGAN is actually generating the image? I looked up "Vision Transformers" but it seems that they're all talking about using it for classification (and how they're beating CNNs...) Maybe I used the wrong term?

I'm fairly new to this world so any help would be greatly appreciated.

Lj Miranda · Answer 9 · Sun Oct 03 2021 17:43:26 GMT+0800 (China Standard Time)

Hi @tjwangml, you can definitely check-out Image Transformers that specifically discussed image generation using transformers. Not sure if that's the one that showed up in your search results.

But it turns out to be the Transformer that's doing the job?

The GAN learns the objects/symbols (cats, looking up, city, night, etc.) and the Transformer paints those objects together (i.e. learning the long-range range dependencies and synthesizing them).

John Gkountouras · Answer 10 · Sat Dec 18 2021 00:41:15 GMT+0800 (China Standard Time)

Wonderful article! I think this part is a bit confusing; The second term, $\log(1-D(\hat{x}))$ measures the probability of the discriminator $D$ to say that a generated instance $\hat{x}$ is real. The way I read it, it is used to measure the complement of that probability instead. I think replacing these two sentences to:

\begin{itemize}
\item In the first term, $D(x)$ is the estimate (made by the discriminator) of the probability that a real data instance $x$ is actually real.
\item In the second term $D(\hat{x})$ is the estimate (made by the distriminator) of the probability that a generated instance $\hat{x}$ is real.
\end{itemize}

would make it clearer.

Lj Miranda · Answer 11 · Mon Dec 20 2021 14:12:48 GMT+0800 (China Standard Time)

@Orpheous1, thanks for catching. You're right. Will update it in a while :)

Mark Strefford · Answer 12 · Thu Feb 10 2022 01:34:29 GMT+0800 (China Standard Time)

Reading through the description, the comments, and looking at the diagrams, I don't see where you explain how the images are generated. The transformer learns long-range interactions, but where does the output of this learning go? The diagram just stops, and I couldn't see where in the narrative you discuss what happens next.
Am I missing something?

Lj Miranda · Answer 13 · Thu Feb 10 2022 02:32:10 GMT+0800 (China Standard Time)

Hi @markstrefford, image generation happens during patch-based sampling (last figure before the Conclusion). You can think of it as sampling different parts of the image, while looking at the context of each neighbor as you go along.

I'm more interested in the novelty of VQGAN, i.e., how it efficiently represents those long-range interactions (why do each neighbor / part of the image tend to have good semantic relationships? "sky is always up, ground is always down, etc. etc."). And why they can do it in high-res.

In the paper, it's pages 4-5 section 3.2. Hope it makes things clearer!

Yixiong · Answer 14 · Fri May 13 2022 00:15:11 GMT+0800 (China Standard Time)

This article is really amazing!

Brooke K. Ryan · Answer 15 · Wed Jun 01 2022 03:02:36 GMT+0800 (China Standard Time)

WOW amazing article! I really love your graphs you included as well to explain things. thank you for your work!

Shahid Karimi · Answer 16 · Sat Dec 31 2022 20:15:38 GMT+0800 (China Standard Time)

I appreciate those who invested their brains in this system. I know a guy who only knows how to run a python code build a business using this amazing software.

Yoonho Na · Answer 17 · Wed Jan 04 2023 11:20:38 GMT+0800 (China Standard Time)

Thank you for the great article.
I have a question about unconditional synthesis.
The embeddings from the encoder seems that they are not fitted in gaussian distribution.
So how is sampling latent vectors done for the transformer?
Is it just sampling random vectors from the codebook?

Yoonho Na · Answer 18 · Sat Jan 07 2023 13:10:37 GMT+0800 (China Standard Time)

I want to modify my question above.
It seems that after selecting some code vectors, the transformer predicts the next code index autoregressively.
Thats how the generating new images are done right?
But how to select the initial code indices for the transformer in the first place?
Just random sampling??

Mohamed Magdy · Answer 19 · Thu Jul 20 2023 19:58:07 GMT+0800 (China Standard Time)

great explanation, Thank you

Kei-Chi Tse · Answer 20 · Fri Dec 15 2023 20:16:05 GMT+0800 (China Standard Time)

A great and easy-to-understand blog really light up my day! Thanks!