this project implements text2video algorithm introduced in paper GODIVA: Generating open-doain videos from natural descriptions
generate imagenet dataset with this script.
create moving single digit dataset with command
python3 dataset/mnist_caption_single.py
after executing successfully, a file named mnist_single_git.h5 is generated.
create moving double digits dataset with command
python3 dataset/mnist_caption_two_digit.py
after executing successfully, a file named mnist_two_gif.h5 is generated. the dataset creation code is borrowed from Sync-Draw and slightly modified.
pretrain VQ-VAE on imagenet with command
python3 pretrain.py --mode train --type (original|ema_update) --train_dir <path/to/trainset> --test_dir <path/to/testset>
save checkpoint to pretrain model file with command
python3 pretrain.py --mode save --type (original|ema_update)
test pretrained model with command
python3 pretrain.py --mode test --type (original|ema_update) --img <path/to/image>
a pretrained model with size 64x64, token_num 10000 and ema_update trained on imagenet is enclosed under directory models
a pair of imagenet-pretrained ema update encoder and decoder are provided in this repo.
here are some reconstruction examples.
![]() | ![]() | ![]() | ![]() |
to test the trained VQVAE on moving mnist dataset
PYTHONPATH=.:${PYTHONPATH} python3 dataset/sample_generator.py
the shown clips are reconstructed by VQVAE.
train GODIVA with command
python3 train.py --dataset (single|double) --batch_size <batch size> --checkpoint <path/to/checkpoint>
test GODIVA with checkpoint with command
python3 test.py --dataset (single|double) --checkpoint <path/to/checkpoint>