Paper Reading Thread

Question

Paper Reading Thread

TiankaiHang opened this issue 4 months ago · comments

Paper	Pub Time	Read Time	Where
Magvit	2023.04.05	2024.03.23	CVPR 2023
Magvit V2	2023.10.09	2024.03.23	ICLR 2024
SDXS	2024.03.26	2024.03.26	Arxiv 2024
NaViT	2023.07	2024.03.26	Arxiv 2024

Tiankai Hang · Answer 1 · Sat Mar 23 2024 21:29:47 GMT+0800 (China Standard Time)

Magvit V2

LANGUAGE MODEL BEATS DIFFUSION — TOKENIZER IS KEY TO VISUAL GENERATION

本文核心假设：

Why do language models lag behind diffusion models in visual generation? the lack of a good visual representation

离散visual token的优势

compatibility with LLMs
Compressed representation
Visual understanding benefits

两个technique

a novel lookup-free quantization method enables the learning of a large vocabulary that is able to improve generation quality of the language model.
identified modifications to the tokenizer that not only enhance generation quality but also enable the tokenization of both images and videos using a shared vocabulary.

具体任务

visual generation, video compression, and action recognition

Method

LOOKUP-FREE QUANTIZER

VQ-VAE 这边有一个观察，并不是codebook长度越大越好。越大甚至会破坏generation的性能

有一个trick是增加codebook的同时减少code的维度。

Lookup-Free Quantization (LFQ). 基于上面的启发，把 codebook 从 $\mathbb{C} \subset \mathbb{R} ^{K \times d}$ 替换成整数集 $|\mathbb{C}| = K$.

具体怎么设计的呢？
特征向量 $\mathbf{z} \in \mathbb{R} ^ {\log _ 2 K}$, 每个维度可以被量化

$$ q (\mathbf{z} _ i ) = C _ {i, j}, j = \arg \min _ {k} \lVert \mathbf{z} _ i - C _ {i, k} \rVert $$

$C _ i$ 是 -1, 1 的集合。

VISUAL TOKENIZER MODEL IMPROVEMENT

Joint image-video tokenization

Tiankai Hang · Answer 2 · Sat Mar 23 2024 22:23:58 GMT+0800 (China Standard Time)

Magvit

MAGVIT: Masked Generative Video Transformer

Tiankai Hang · Answer 3 · Tue Mar 26 2024 19:18:09 GMT+0800 (China Standard Time)

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

Paper

罕见的小米的paper

感觉也属于是大杂烩。主要是在Unet, vae, NFE 上进行改进。

vae 蒸馏

unet剪枝
用了BK-SDM里面的方法，移除了一些residual and Transformer blocks

一步训练

feature matching warmup

Segmented Diff-Instruct

Tiankai Hang · Answer 4 · Tue Mar 26 2024 19:19:36 GMT+0800 (China Standard Time)

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Paper

看这篇文章主要是因为sora可能用到了里面的技术。

具体方法

NaViT是基于原始的ViT构建的，但原则上可以使用任何对补丁序列进行操作的ViT变体。为了实现Patch n' Pack，我们进行以下架构修改。

结构上的改进

Masked self attention and masked pooling

让每张图的token只attend到自己，并且pooling也只在每个sample对应的token上做。

Factorized & fractional positional embeddings

image resolution $R^2$ patch size $P$

vanilla vit：学习1-D positional embedding，长度是 $(R/P) ^ 2$. 高分辨率上直接插值

Pix2struct: 2D absolute positional embeddings, [maxLen, maxLen], 缺点 every combination of (x, y) coordinates must be seen during training.

$x, y$ 分别进行映射相加。考虑两种选择，absolute embeddings：直接将 [0, maxLen] 里面的id映射到 D-dim向量。或者是根据比例，把[0, 1]之间的 $r = p / \text{sidelength}$ 映射到D-dim向量。

训练技巧

Continuous Token dropping: token dropping rate can be varied per-image
Resolution sampling.

具体实现细节

Packing 用greedy的方式实现，配以data sampling的策略，保证空的比例小于百分之二。

Packing of examples into sequences is done alongside batching. A simple greedy approach is used which adds examples to the first sequence with enough remaining space. Once no more examples can fit, sequences are filled with padding tokens, yielding the fixed sequence lengths needed for batched operations. Such simple packing algorithm can lead to a significant padding, depending on the distribution of length of inputs. There are several methods to address such limitations, like bin packing (Krell et al., 2021), which allows minimizing the padding. Here, in NaViT, since controlling the resolutions we sample, we can ensure efficient packing by tuning the sequence length and limit padding to less than 2%.

Tiankai Hang · Answer 5 · Wed Mar 27 2024 19:57:42 GMT+0800 (China Standard Time)

Photorealistic Video Generation with Diffusion Models

aka W.A.L.T

Task: Video Generation

Pipeline 号比较简洁。先过一个Causual Encoder压缩一下，得到latent code，然后以transformer为backbone的diffusion model去model这个latent code的distribution。关键在于说这个encoder是怎么实现的，实际训练的时候transformer是怎么设计的。

Learning Visual Tokens

视频序列 $\mathbf{x} \in \mathbb{R} ^ {(1 + T) \times H \times W \times C}$ 映射到 $\mathbf{z} \in \mathbb{R} ^ {(1 + t) \times h \times w \times c} $ 时空上都有压缩。第一帧是独立压缩的。

结构上直接用的Magvit v2的

原文关于magvit v2的causal cnn的描述长这样，其实有点不太好理解

我的理解是，常规的3D Conv对某一帧操作的时候会往前看一部分帧，也能看到后面一部分帧，这边是强制只看前面的帧。我觉得这边起名为 shifted 3D Conv 也未尝不可。这个操作还有一个好处，就是可以对第一针单独进行处理。

其他的一些改进

downsample里面的average pooling改成strided conv
decoder里面的upsampler：把nearest resizing改为 conv with a depth-to-space operator.
将时间下采样推迟到编码器的最后几个模块。
the downsampling layer in the discriminator now utilizes 3D blur pooling to encourage shift invariance
add one adaptive group normalization layer before the residual blocks at each resolution in the decoder to pass in the quantized latents as the control signal following StyleGAN

关于depth to space

DepthToSpace rearranges (permutes) data from depth into blocks of spatial data. This is the reverse transformation of SpaceToDepth. More specifically, this op outputs a copy of the input tensor where values from the depth dimension are moved in spatial blocks to the height and width dimensions.
Input tensor of [N,C,H,W], Output tensor of [N, C/(blocksize * blocksize), H * blocksize, W * blocksize].

看这个描述貌似就是pixelshuffle?

tokenizer design的一些ablation

WALT里面和原始的magvit v2有一个区别，就是这边为了后续用diffusion，所有latent code都是连续的。

Learning to Generate Images and Videos

经过上一节 Learning Visual Tokens 的处理，我们已经拿到一个 $\mathbf{z} \in \mathbb{R} ^ {(1 + t) \times h \times w \times c}$ 的tensor。

Patchify

和ViT里面操作类似，然后用learnable的position embedding

We use learnable posi�tional embeddings [73], which are the sum of space and time positional embeddings.
Position embeddings are added to the linear projections [18] of the patches. Note that for images, we simply add the temporal position embedding corresponding to the first latent frame

Window attention

两种

Spatial Window (SW), 在每一个frame内部进行操作 $1 \times h _ p \times w _ p$
Spatiotemporal Window (STW) attention，$(1 + t) \times h' _ p \times w' _ p$

具体描述

Conditional Generation

研究了三种条件注入的方式

Conditional Generation
AdaLN-LoRA
Self-conditioning.

for joint training, we only use SW cross-attention layers. For cross-attention we concatenate the input signal (query) with the conditioning signal (key, value) as our early experiments showed this improves performance.

AdaLN-LoRA：

Self-conditioning.

Autoregressive Generation

train our model jointly on the task of frame prediction.
achieved by conditioning the model on past frames with a probability of $p _ {\text{fp}}$ during training.

Video Super Resolution

级联生成，先生成 128x128 的，然后接两个超分（SR）的模型。那么问题来了，SR的model是什么样的。

用 depth-to-space conv 将低分辨率的 $\mathbf{z} ^ {\text{lr}}$ 上采样. 为了减少差异，提升鲁棒性，noise level $t _ {sr} \sim \mathcal{U} (0, t _ {\text{max}})$ 通过 AdaLN-LoRA注入

实验

先看ablation

在video和image benchmark上的结果

具体实验超参

To train the second stage transformer model, we use the default settings of 1 × 16 × 16 spatial window, 5 × 8 × 8 spatiotemporal window, psc = 0.9, c = 8 and r = 2.

生成 17 × 128 × 128 的 Model Size 是3B，
two 2× cascaded super-resolution models for 17 × 128 × 224 → 17 × 256 × 448 (L, 1.3B, p = 2) and 17 × 256 × 448 → 17 × 512 × 896 (L, 419M, p = 2) respectively.

Final Question: Will WALT be the solution to train SORA?

Tiankai Hang · Answer 6 · Fri Mar 29 2024 16:21:40 GMT+0800 (China Standard Time)

TextCraftor: Your Text Encoder Can be Image Quality Controller

Paper

独特视角，之前finetune stable diffusion 都是tune unet，这篇文章说是要tune text encoder，然后tune uent+text encoder 效果能更好

整个pipeline

算法流程

算法貌似有bug，t 没有赋值 t <- t - 1

总结下来就是 DDIM 采样，得到一张图，然后用一些reward model算loss，然后backward 更新 text encoder。而且知识算了loss，都没有backward，这也太不严谨了。

如果要更新 UNet, 需要先fix text encoder。

具体实验

reward function怎么选？Human Preference Score v2 (HPSv2) [55] and PickScore [24]), and CLIP model [37].

Training Datasets. OpenPrompt 10M 高质量prompt

真有钱啊，用的计算资源：8 NVIDIA A100 nodes with 8 GPUs per node

tune好的text encoder也能直接给SDXL 用。也有一定的提升，但是不知道为啥没有在table 1/2 里面放这个实验结果，只放了visualization的结果。换句话说，其实这个point并没有定量的指标进行验证。

table 3 是不是比较tricky，没有比sdxl 1.0.

not that promissing.

Tiankai Hang · Answer 7 · Fri Mar 29 2024 16:29:17 GMT+0800 (China Standard Time)

sDPO: Don’t Use Your Data All at Once

Paper

pipeline 长这样，貌似也挺水的，之前train DPO 是用了所有的数据一起train，这里是先用一部分train，然后在用一部分，一步步train

训练设定，只用了两个step？没有更多的ablation？貌似不太靠谱啊，这个paper