- Use Xpos instead of sinusoid positional embeddings
- Use more efficient time+condition injection in PixArt-α
- Use "register" tokens https://arxiv.org/abs/2309.16588
- Use DeepSeek MoE https://arxiv.org/abs/2401.06066
- Use masked training objective
- Use smooth latent space https://arxiv.org/abs/2312.04410
Run cifar10_dit.py
From bs=256 step=156k