lucidrains / lumiere-pytorch

Implementation of Lumiere, SOTA text-to-video generation from Google Deepmind, in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lumiere - Pytorch

Implementation of Lumiere, SOTA text-to-video generation from Google Deepmind, in Pytorch

Yannic's paper review

Since this paper is mostly just a few key ideas on top of text-to-image model, will take it a step further and extend the new Karras U-net to video within this repository.

Appreciation

Install

$ pip install lumiere-pytorch

Usage

import torch
from lumiere_pytorch import MPLumiere

from denoising_diffusion_pytorch import KarrasUnet

karras_unet = KarrasUnet(
    image_size = 256,
    dim = 8,
    channels = 3,
    dim_max = 768,
)

lumiere = MPLumiere(
    karras_unet,
    image_size = 256,
    unet_time_kwarg = 'time',
    conv_module_names = [
        'downs.1',
        'ups.1',
        'downs.2',
        'ups.2',
    ],
    attn_module_names = [
        'mids.0'
    ],
    upsample_module_names = [
        'ups.2',
        'ups.1',
    ],
    downsample_module_names = [
        'downs.1',
        'downs.2'
    ]
)

noised_video = torch.randn(2, 3, 8, 256, 256)
time = torch.ones(2,)

denoised_video = lumiere(noised_video, time = time)

assert noised_video.shape == denoised_video.shape

Todo

  • add all temporal layers

    • researcher must pass in all layers for
      • conv inflation modules (stages)
      • attn inflation modules (middle)
      • temporal downsample
      • temporal upsamples
    • validate time dimension is 2 ** downsample layers
    • validate number of downsamples == upsamples
    • at init, do a dry run with a mock tensor and assert output is the same
  • expose only temporal parameters for learning, freeze everything else

  • figure out the best way to deal with the time conditioning after temporal downsampling - instead of pytree transform at the beginning, probably will need to hook into all the modules and inspect the batch sizes

  • handle middle modules that may have output shape as (batch, seq, dim)

  • following the conclusions of Tero Karras, improvise a variant of the 4 modules with magnitude preservation

  • test out on imagen-pytorch

  • look into multi-diffusion and see if it can turned into some simple wrapper

Citations

@inproceedings{BarTal2024LumiereAS,
    title   = {Lumiere: A Space-Time Diffusion Model for Video Generation},
    author  = {Omer Bar-Tal and Hila Chefer and Omer Tov and Charles Herrmann and Roni Paiss and Shiran Zada and Ariel Ephrat and Junhwa Hur and Yuanzhen Li and Tomer Michaeli and Oliver Wang and Deqing Sun and Tali Dekel and Inbar Mosseri},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:267095113}
}
@article{Karras2023AnalyzingAI,
    title   = {Analyzing and Improving the Training Dynamics of Diffusion Models},
    author  = {Tero Karras and Miika Aittala and Jaakko Lehtinen and Janne Hellsten and Timo Aila and Samuli Laine},
    journal = {ArXiv},
    year    = {2023},
    volume  = {abs/2312.02696},
    url     = {https://api.semanticscholar.org/CorpusID:265659032}
}

About

Implementation of Lumiere, SOTA text-to-video generation from Google Deepmind, in Pytorch

License:MIT License


Languages

Language:Python 100.0%