ljvmiranda921 / comments.ljvmiranda921.github.io

Blog comments for my personal blog: ljvmiranda921.github.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Illustrated VQGAN

utterances-bot opened this issue · comments

The Illustrated VQGAN

VQGAN allows us to generate high-resolution images from text, and has now taken art Twitter by storm. Let me talk about how it works on a conceptual level in...

https://ljvmiranda921.github.io/notebook/2021/08/08/clip-vqgan/

commented

Great article! Especially liked the Excalidraw visuals and your interpretation on how the community came to arrive at VQGAN-CLIP.

AFAIK there were attempts before VQGAN that utilized BigGAN, not too sure how we ended up with VQGAN being the dominant variation in the circulated Colab notebooks

Hi @tmwei thanks a lot!

AFAIK there were attempts before VQGAN that utilized BigGAN,

Agree, I noticed when I was compiling a list of CLIP+xGAN implementations that there were initial versions using BigGAN.

not too sure how we ended up with VQGAN being the dominant variation in the circulated Colab notebooks

Your guess is as good as mine! I'd wager that social media played a big role? It's the perfect storm: generative artists minting NFTs, promoting generative artwork on Twitter, new Colab notebooks by @crowsonkb etc.

I think VQGAN ended up being the dominant version because people liked the aesthetics of the outputs better. Also the ratio of good outputs to bad outputs for a given prompt is considerably better. My z+quantize method is particularly popular (I have two VQGAN methods) and I think the reason for that is it produces outputs that begin to look like the prompt in as few as 50 iterations. People like the instant gratification.

Hi @crowsonkb thanks a lot for enlightening us 🙇 Looking back, I agree, the aesthetics of VQGAN appears to be better. Instant gratification adds to the "wow factor"

Thanks again for your huge contributions to the generative art community! Your Colab notebooks have inspired and awed many :)

commented

u/Wiskkey is tallying the explosion of VQGAN projects for image generation at https://www.reddit.com/user/Wiskkey/comments/p2j673/list_part_created_on_august_11_2021/.

Given the recent release of guided diffusion notebooks, it might not be too long before that becomes the mainstream backbone instead. Interesting times for AI art!

Thank you so much for this article. I have a long way to go, but this article gave me a solid foothold at the base of the mountain :)

Hey @smawpaw glad it helped, wishing you well in your journey :)

This is the best article I've seen so far explaining the architecture. Thank you so much LJ!

I originally thought it was the GAN inside the VQGAN that is doing the image synthesis. But it turns out to be the Transformer that's doing the job? Would you mind sharing some posts or videos about how the Transformer in VQGAN is actually generating the image? I looked up "Vision Transformers" but it seems that they're all talking about using it for classification (and how they're beating CNNs...) Maybe I used the wrong term?

I'm fairly new to this world so any help would be greatly appreciated.

Hi @tjwangml, you can definitely check-out Image Transformers that specifically discussed image generation using transformers. Not sure if that's the one that showed up in your search results.

But it turns out to be the Transformer that's doing the job?

The GAN learns the objects/symbols (cats, looking up, city, night, etc.) and the Transformer paints those objects together (i.e. learning the long-range range dependencies and synthesizing them).

Wonderful article! I think this part is a bit confusing; The second term, $\log(1-D(\hat{x}))$ measures the probability of the discriminator $D$ to say that a generated instance $\hat{x}$ is real. The way I read it, it is used to measure the complement of that probability instead. I think replacing these two sentences to:

\begin{itemize}
\item In the first term, $D(x)$ is the estimate (made by the discriminator) of the probability that a real data instance $x$ is actually real.
\item In the second term $D(\hat{x})$ is the estimate (made by the distriminator) of the probability that a generated instance $\hat{x}$ is real.
\end{itemize}

would make it clearer.

@Orpheous1, thanks for catching. You're right. Will update it in a while :)

Reading through the description, the comments, and looking at the diagrams, I don't see where you explain how the images are generated. The transformer learns long-range interactions, but where does the output of this learning go? The diagram just stops, and I couldn't see where in the narrative you discuss what happens next.
Am I missing something?

Hi @markstrefford, image generation happens during patch-based sampling (last figure before the Conclusion). You can think of it as sampling different parts of the image, while looking at the context of each neighbor as you go along.

I'm more interested in the novelty of VQGAN, i.e., how it efficiently represents those long-range interactions (why do each neighbor / part of the image tend to have good semantic relationships? "sky is always up, ground is always down, etc. etc."). And why they can do it in high-res.

In the paper, it's pages 4-5 section 3.2. Hope it makes things clearer!

This article is really amazing!

WOW amazing article! I really love your graphs you included as well to explain things. thank you for your work!

I appreciate those who invested their brains in this system. I know a guy who only knows how to run a python code build a business using this amazing software.

Thank you for the great article.
I have a question about unconditional synthesis.
The embeddings from the encoder seems that they are not fitted in gaussian distribution.
So how is sampling latent vectors done for the transformer?
Is it just sampling random vectors from the codebook?

I want to modify my question above.
It seems that after selecting some code vectors, the transformer predicts the next code index autoregressively.
Thats how the generating new images are done right?
But how to select the initial code indices for the transformer in the first place?
Just random sampling??

great explanation, Thank you

A great and easy-to-understand blog really light up my day! Thanks!