`RATV`

From the paper Shot Retrieval and Assembly with Text Script for Video Montage Generation in PyTorch.

Code derived from the supplementary material section here.

Datasets

Original (paper's) VSPD dataset at https://github.com/RATVDemo/RATV

Setup / Installation

Assuming the RATV/ dir as root...

Env

pip install -r requirements.txt

Data Download

For VSPD:

wget -O shots.tar "https://drive.google.com/uc?export=download&id=1wL3tgbDmzHL0arkBP_9SsLrOZQuiP1HL&confirm=t"
tar -xvf shots.tar
wget -O train_captions.jsonl "https://drive.google.com/uc?export=download&id=1vFW_z8WgMYgzw_evhFKs3QyucJ6DwKzT"
wget -O test_captions.jsonl "https://drive.google.com/uc?export=download&id=1s8TdVW6gGGlQYKBYDsyUqr7zkNHjVp94"

For tiktok, assuming the bucket tiktok_montage is accessible from the instance:

gsutil -m cp -r "gs://tiktok_montage/scenes" .
gsutil -m cp "gs://tiktok_montage/tiktok_montage_gt_4w_9k.json" .

Data-Embedding

We need to embed the videos before training and all.

For this, run embed_videos.py

Training

With train.py. For instance:

export WANDB_API_KEY="..."
export ENCODER_TYPE="blip2"
python ~/RATV/train.py --data_dir ~/RATV/Embeddings_blip2_proj_f5/ --json_file ~/RATV/data/train_captions.jsonl --embed_dim 256 --hidden_size 256 --learning_rate 1e-4 --loss_weight 0.5 --batch_size 128 --wandb_name "RATV-Full" --epochs 200

Generation

With generate.py. For instance:

export ENCODER_TYPE="blip2"
python ~/RATV/generate.py \
--data_dir ~/RATV/Embeddings_blip2_f5/ \
--data_dir_proj ~/RATV/Embeddings_blip2_proj_f5/ \
--json_file ~/RATV/data/test_captions.jsonl \
--transformer_path ~/RATV/checkpoints-glad-bee/transformer_pretrained_40.pt \
--threshold 0.5 \
--weight 0.75 \
--hidden_size 768 \
--sim_func default \
--allow_multi_embed \
--output_file ~/RATV/test_captions_generated.jsonl

Evaluation

With evaluate.py. For instance:

export ENCODER_TYPE="blip2"
python ~/RATV/evaluate.py \
--fea_root ~/RATV/Embeddings_blip2_proj_f5/ \ 
--target_file ~/RATV/data/test_captions.jsonl \
--generated_file ~/RATV/test_captions_generated.jsonl

Note: for consistency, we evaluate all outputs with BLIP2 embeddings.

bglearning / video-montage