TiankaiHang / blog

For self learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Paper Reading Thread

TiankaiHang opened this issue · comments

Paper Pub Time Read Time Where
Magvit 2023.04.05 2024.03.23 CVPR 2023
Magvit V2 2023.10.09 2024.03.23 ICLR 2024
SDXS 2024.03.26 2024.03.26 Arxiv 2024
NaViT 2023.07 2024.03.26 Arxiv 2024

Magvit V2

LANGUAGE MODEL BEATS DIFFUSION — TOKENIZER IS KEY TO VISUAL GENERATION

本文核心假设:

Why do language models lag behind diffusion models in visual generation? the lack of a good visual representation

离散visual token的优势

  • compatibility with LLMs
  • Compressed representation
  • Visual understanding benefits

两个technique

  1. a novel lookup-free quantization method enables the learning of a large vocabulary that is able to improve generation quality of the language model.
  2. identified modifications to the tokenizer that not only enhance generation quality but also enable the tokenization of both images and videos using a shared vocabulary.

具体任务

visual generation, video compression, and action recognition

Method

LOOKUP-FREE QUANTIZER

VQ-VAE 这边有一个观察,并不是codebook长度越大越好。越大甚至会破坏generation的性能

image

有一个trick是增加codebook的同时减少code的维度。

Lookup-Free Quantization (LFQ). 基于上面的启发,把 codebook 从 $\mathbb{C} \subset \mathbb{R} ^{K \times d}$ 替换成整数集 $|\mathbb{C}| = K$.

具体怎么设计的呢?
特征向量 $\mathbf{z} \in \mathbb{R} ^ {\log _ 2 K}$, 每个维度可以被量化

$$ q (\mathbf{z} _ i ) = C _ {i, j}, j = \arg \min _ {k} \lVert \mathbf{z} _ i - C _ {i, k} \rVert $$

$C _ i$ 是 -1, 1 的集合。

VISUAL TOKENIZER MODEL IMPROVEMENT

Joint image-video tokenization

image

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

Paper

罕见的小米的paper

image

感觉也属于是大杂烩。主要是在Unet, vae, NFE 上进行改进。

vae 蒸馏
image

unet剪枝
用了BK-SDM里面的方法,移除了一些residual and Transformer blocks
image

一步训练

  • feature matching warmup

image

  • Segmented Diff-Instruct

image

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Paper

看这篇文章主要是因为sora可能用到了里面的技术。

具体方法

NaViT是基于原始的ViT构建的,但原则上可以使用任何对补丁序列进行操作的ViT变体。为了实现Patch n' Pack,我们进行以下架构修改。

结构上的改进

  • Masked self attention and masked pooling

image

让每张图的token只attend到自己,并且pooling也只在每个sample对应的token上做。

  • Factorized & fractional positional embeddings

image resolution $R^2$ patch size $P$

vanilla vit: 学习1-D positional embedding,长度是 $(R/P) ^ 2$. 高分辨率上直接插值

Pix2struct: 2D absolute positional embeddings, [maxLen, maxLen], 缺点 every combination of (x, y) coordinates must be seen during training.

$x, y$ 分别进行映射相加。考虑两种选择,absolute embeddings:直接将 [0, maxLen] 里面的id映射到 D-dim向量。或者是根据比例,把[0, 1]之间的 $r = p / \text{sidelength}$ 映射到D-dim向量。

image

训练技巧

  • Continuous Token dropping: token dropping rate can be varied per-image
  • Resolution sampling.
    image

具体实现细节

Packing 用greedy的方式实现,配以data sampling的策略,保证空的比例小于百分之二。

Packing of examples into sequences is done alongside batching. A simple greedy approach is used which adds examples to the first sequence with enough remaining space. Once no more examples can fit, sequences are filled with padding tokens, yielding the fixed sequence lengths needed for batched operations. Such simple packing algorithm can lead to a significant padding, depending on the distribution of length of inputs. There are several methods to address such limitations, like bin packing (Krell et al., 2021), which allows minimizing the padding. Here, in NaViT, since controlling the resolutions we sample, we can ensure efficient packing by tuning the sequence length and limit padding to less than 2%.

Photorealistic Video Generation with Diffusion Models

aka W.A.L.T

Task: Video Generation

image

Pipeline 号比较简洁。先过一个Causual Encoder压缩一下,得到latent code,然后以transformer为backbone的diffusion model去model这个latent code的distribution。关键在于说这个encoder是怎么实现的,实际训练的时候transformer是怎么设计的。

Learning Visual Tokens

视频序列 $\mathbf{x} \in \mathbb{R} ^ {(1 + T) \times H \times W \times C}$ 映射到 $\mathbf{z} \in \mathbb{R} ^ {(1 + t) \times h \times w \times c} $ 时空上都有压缩。第一帧是独立压缩的。

结构上直接用的Magvit v2的
image

原文关于magvit v2的causal cnn的描述长这样,其实有点不太好理解
image

我的理解是,常规的3D Conv对某一帧操作的时候会往前看一部分帧,也能看到后面一部分帧,这边是强制只看前面的帧。我觉得这边起名为 shifted 3D Conv 也未尝不可。这个操作还有一个好处,就是可以对第一针单独进行处理。

其他的一些改进

  • downsample里面的average pooling改成strided conv
  • decoder里面的upsampler:把nearest resizing改为 conv with a depth-to-space operator.
  • 将时间下采样推迟到编码器的最后几个模块。
  • the downsampling layer in the discriminator now utilizes 3D blur pooling to encourage shift invariance
  • add one adaptive group normalization layer before the residual blocks at each resolution in the decoder to pass in the quantized latents as the control signal following StyleGAN

关于depth to space

DepthToSpace rearranges (permutes) data from depth into blocks of spatial data. This is the reverse transformation of SpaceToDepth. More specifically, this op outputs a copy of the input tensor where values from the depth dimension are moved in spatial blocks to the height and width dimensions.
Input tensor of [N,C,H,W], Output tensor of [N, C/(blocksize * blocksize), H * blocksize, W * blocksize].

看这个描述貌似就是pixelshuffle?

tokenizer design的一些ablation
image

WALT里面和原始的magvit v2有一个区别,就是这边为了后续用diffusion,所有latent code都是连续的。

Learning to Generate Images and Videos

经过上一节 Learning Visual Tokens 的处理,我们已经拿到一个 $\mathbf{z} \in \mathbb{R} ^ {(1 + t) \times h \times w \times c}$ 的tensor。

Patchify

和ViT里面操作类似,然后用learnable的position embedding

We use learnable posi�tional embeddings [73], which are the sum of space and time positional embeddings.
Position embeddings are added to the linear projections [18] of the patches. Note that for images, we simply add the temporal position embedding corresponding to the first latent frame

Window attention

image

两种

  • Spatial Window (SW), 在每一个frame内部进行操作 $1 \times h _ p \times w _ p$
  • Spatiotemporal Window (STW) attention,$(1 + t) \times h' _ p \times w' _ p$

具体描述
image

Conditional Generation

研究了三种条件注入的方式

  • Conditional Generation
  • AdaLN-LoRA
  • Self-conditioning.

for joint training, we only use SW cross-attention layers. For cross-attention we concatenate the input signal (query) with the conditioning signal (key, value) as our early experiments showed this improves performance.

AdaLN-LoRA:
image

Self-conditioning.
image

Autoregressive Generation

train our model jointly on the task of frame prediction.
achieved by conditioning the model on past frames with a probability of $p _ {\text{fp}}$ during training.

Video Super Resolution

级联生成,先生成 128x128 的,然后接两个超分(SR)的模型。那么问题来了,SR的model是什么样的。

用 depth-to-space conv 将低分辨率的 $\mathbf{z} ^ {\text{lr}}$ 上采样. 为了减少差异,提升鲁棒性,noise level $t _ {sr} \sim \mathcal{U} (0, t _ {\text{max}})$ 通过 AdaLN-LoRA注入

实验

先看ablation

image

在video和image benchmark上的结果

image

具体实验超参

To train the second stage transformer model, we use the default settings of 1 × 16 × 16 spatial window, 5 × 8 × 8 spatiotemporal window, psc = 0.9, c = 8 and r = 2.

image

生成 17 × 128 × 128 的 Model Size 是3B,
two 2× cascaded super-resolution models for 17 × 128 × 224 → 17 × 256 × 448 (L, 1.3B, p = 2) and 17 × 256 × 448 → 17 × 512 × 896 (L, 419M, p = 2) respectively.

Final Question: Will WALT be the solution to train SORA?

TextCraftor: Your Text Encoder Can be Image Quality Controller

Paper

独特视角,之前finetune stable diffusion 都是tune unet,这篇文章说是要tune text encoder,然后tune uent+text encoder 效果能更好

image

整个pipeline

image

算法流程
image
算法貌似有bug,t 没有赋值 t <- t - 1

总结下来就是 DDIM 采样,得到一张图,然后用一些reward model算loss,然后backward 更新 text encoder。而且知识算了loss,都没有backward,这也太不严谨了。

如果要更新 UNet, 需要先fix text encoder。

具体实验

reward function怎么选?Human Preference Score v2 (HPSv2) [55] and PickScore [24]), and CLIP model [37].

Training Datasets. OpenPrompt 10M 高质量prompt

真有钱啊,用的计算资源:8 NVIDIA A100 nodes with 8 GPUs per node

tune好的text encoder也能直接给SDXL 用。也有一定的提升,但是不知道为啥没有在table 1/2 里面放这个实验结果,只放了visualization的结果。换句话说,其实这个point并没有定量的指标进行验证。

table 3 是不是比较tricky,没有比sdxl 1.0.

not that promissing.

sDPO: Don’t Use Your Data All at Once

image

Paper

pipeline 长这样,貌似也挺水的,之前train DPO 是用了所有的数据一起train,这里是先用一部分train,然后在用一部分,一步步train

image

训练设定,只用了两个step?没有更多的ablation?貌似不太靠谱啊,这个paper