Paper Reading Thread
TiankaiHang opened this issue · comments
Magvit V2
LANGUAGE MODEL BEATS DIFFUSION — TOKENIZER IS KEY TO VISUAL GENERATION
本文核心假设:
Why do language models lag behind diffusion models in visual generation? the lack of a good visual representation
离散visual token的优势
- compatibility with LLMs
- Compressed representation
- Visual understanding benefits
两个technique
- a novel lookup-free quantization method enables the learning of a large vocabulary that is able to improve generation quality of the language model.
- identified modifications to the tokenizer that not only enhance generation quality but also enable the tokenization of both images and videos using a shared vocabulary.
具体任务
visual generation, video compression, and action recognition
Method
LOOKUP-FREE QUANTIZER
VQ-VAE 这边有一个观察,并不是codebook长度越大越好。越大甚至会破坏generation的性能
有一个trick是增加codebook的同时减少code的维度。
Lookup-Free Quantization (LFQ). 基于上面的启发,把 codebook 从
具体怎么设计的呢?
特征向量
VISUAL TOKENIZER MODEL IMPROVEMENT
Joint image-video tokenization
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
看这篇文章主要是因为sora可能用到了里面的技术。
具体方法
NaViT是基于原始的ViT构建的,但原则上可以使用任何对补丁序列进行操作的ViT变体。为了实现Patch n' Pack,我们进行以下架构修改。
结构上的改进
- Masked self attention and masked pooling
让每张图的token只attend到自己,并且pooling也只在每个sample对应的token上做。
- Factorized & fractional positional embeddings
image resolution
vanilla vit
: 学习1-D positional embedding,长度是
Pix2struct
: 2D absolute positional embeddings, [maxLen, maxLen], 缺点 every combination of (x, y) coordinates must be seen during training.
训练技巧
具体实现细节
Packing 用greedy的方式实现,配以data sampling的策略,保证空的比例小于百分之二。
Packing of examples into sequences is done alongside batching. A simple greedy approach is used which adds examples to the first sequence with enough remaining space. Once no more examples can fit, sequences are filled with padding tokens, yielding the fixed sequence lengths needed for batched operations. Such simple packing algorithm can lead to a significant padding, depending on the distribution of length of inputs. There are several methods to address such limitations, like bin packing (Krell et al., 2021), which allows minimizing the padding. Here, in NaViT, since controlling the resolutions we sample, we can ensure efficient packing by tuning the sequence length and limit padding to less than 2%.
Photorealistic Video Generation with Diffusion Models
aka W.A.L.T
Task: Video Generation
Pipeline 号比较简洁。先过一个Causual Encoder压缩一下,得到latent code,然后以transformer为backbone的diffusion model去model这个latent code的distribution。关键在于说这个encoder是怎么实现的,实际训练的时候transformer是怎么设计的。
Learning Visual Tokens
视频序列
原文关于magvit v2的causal cnn的描述长这样,其实有点不太好理解
我的理解是,常规的3D Conv对某一帧操作的时候会往前看一部分帧,也能看到后面一部分帧,这边是强制只看前面的帧。我觉得这边起名为 shifted 3D Conv 也未尝不可。这个操作还有一个好处,就是可以对第一针单独进行处理。
其他的一些改进
- downsample里面的average pooling改成strided conv
- decoder里面的upsampler:把nearest resizing改为 conv with a depth-to-space operator.
- 将时间下采样推迟到编码器的最后几个模块。
- the downsampling layer in the discriminator now utilizes 3D blur pooling to encourage shift invariance
- add one adaptive group normalization layer before the residual blocks at each resolution in the decoder to pass in the quantized latents as the control signal following StyleGAN
关于depth to space
DepthToSpace rearranges (permutes) data from depth into blocks of spatial data. This is the reverse transformation of SpaceToDepth. More specifically, this op outputs a copy of the input tensor where values from the depth dimension are moved in spatial blocks to the height and width dimensions.
Input tensor of [N,C,H,W], Output tensor of [N, C/(blocksize * blocksize), H * blocksize, W * blocksize].
看这个描述貌似就是pixelshuffle?
WALT里面和原始的magvit v2有一个区别,就是这边为了后续用diffusion,所有latent code都是连续的。
Learning to Generate Images and Videos
经过上一节 Learning Visual Tokens
的处理,我们已经拿到一个
Patchify
和ViT里面操作类似,然后用learnable的position embedding
We use learnable posi�tional embeddings [73], which are the sum of space and time positional embeddings.
Position embeddings are added to the linear projections [18] of the patches. Note that for images, we simply add the temporal position embedding corresponding to the first latent frame
Window attention
两种
- Spatial Window (SW), 在每一个frame内部进行操作
$1 \times h _ p \times w _ p$ - Spatiotemporal Window (STW) attention,$(1 + t) \times h' _ p \times w' _ p$
Conditional Generation
研究了三种条件注入的方式
- Conditional Generation
- AdaLN-LoRA
- Self-conditioning.
for joint training, we only use SW cross-attention layers. For cross-attention we concatenate the input signal (query) with the conditioning signal (key, value) as our early experiments showed this improves performance.
Autoregressive Generation
train our model jointly on the task of frame prediction.
achieved by conditioning the model on past frames with a probability of$p _ {\text{fp}}$ during training.
Video Super Resolution
级联生成,先生成 128x128 的,然后接两个超分(SR)的模型。那么问题来了,SR的model是什么样的。
用 depth-to-space conv 将低分辨率的
实验
先看ablation
在video和image benchmark上的结果
具体实验超参
To train the second stage transformer model, we use the default settings of 1 × 16 × 16 spatial window, 5 × 8 × 8 spatiotemporal window, psc = 0.9, c = 8 and r = 2.
生成 17 × 128 × 128 的 Model Size 是3B,
two 2× cascaded super-resolution models for 17 × 128 × 224 → 17 × 256 × 448 (L, 1.3B, p = 2) and 17 × 256 × 448 → 17 × 512 × 896 (L, 419M, p = 2) respectively.
Final Question: Will WALT be the solution to train SORA?
TextCraftor: Your Text Encoder Can be Image Quality Controller
独特视角,之前finetune stable diffusion 都是tune unet,这篇文章说是要tune text encoder,然后tune uent+text encoder 效果能更好
整个pipeline
算法流程
算法貌似有bug,t 没有赋值 t <- t - 1
总结下来就是 DDIM 采样,得到一张图,然后用一些reward model算loss,然后backward 更新 text encoder。而且知识算了loss,都没有backward,这也太不严谨了。
如果要更新 UNet, 需要先fix text encoder。
具体实验
reward function怎么选?Human Preference Score v2 (HPSv2) [55] and PickScore [24]), and CLIP model [37].
Training Datasets. OpenPrompt
10M 高质量prompt
真有钱啊,用的计算资源:8 NVIDIA A100 nodes with 8 GPUs per node
tune好的text encoder也能直接给SDXL 用。也有一定的提升,但是不知道为啥没有在table 1/2 里面放这个实验结果,只放了visualization的结果。换句话说,其实这个point并没有定量的指标进行验证。
table 3 是不是比较tricky,没有比sdxl 1.0.
not that promissing.
sDPO: Don’t Use Your Data All at Once
pipeline 长这样,貌似也挺水的,之前train DPO
是用了所有的数据一起train,这里是先用一部分train,然后在用一部分,一步步train
训练设定,只用了两个step?没有更多的ablation?貌似不太靠谱啊,这个paper