SVD_Xtend

🎨✨ Stable Video Diffusion Training Code 🚀

Comparison

size=(512, 320), motion_bucket_id=127, fps=7, noise_aug_strength=0.00
generator=torch.manual_seed(111)

Init Image	Before Fine-tuning	After Fine-tuning

Training Configuration(on the BDD100K dataset)

This training configuration is for reference only, I set all parameters of unet to be trainable during the training and adopted a learning rate of 1e-5.

accelerate launch train_svd.py \
    --pretrained_model_name_or_path=/path/to/weight \
    --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
    --max_train_steps=50000 \
    --width=512 \
    --height=320 \
    --checkpointing_steps=1000 --checkpoints_total_limit=1 \
    --learning_rate=1e-5 --lr_warmup_steps=0 \
    --seed=123 \
    --mixed_precision="fp16" \
    --validation_steps=200

Disclaimer

While the codebase is functional and provides an enhancement in video generation(maybe? 🤷), it's important to note that there are still some uncertainties regarding the finer details of its implementation.

TODO List

Support text2video
Support more conditional inputs, such as layout

Contribution

Feel free to fork this repository, submit pull requests, or open issues to discuss potential changes or report bugs. With your valuable input, we can continuously improve SVD_Xtend for the community.

ma-xu / SVD_Xtend