lilujunai/StoryGPT-V

🚀 Get Started

Environment Setup

conda env create -f environment.yaml
conda activate story

External Package

# Use Lavis BLIP2 for Text-Image alignment evaluation
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .
cp -r lavis eval/lavis

1️⃣ Data

Download dataset and put them under data/flintstones and data/pororo

FlintstonesSV: [Download]

PororoSV: [Download]

2️⃣ Training

First Stage: Char-LDM

bash scripts/train_ldm.sh DATASET

Prepare CLIP embedding after first stage

bash scripts/clip.sh DATASET CKPT_PATH

Second Stage: Align LLM with Char-LDM, you can choose OPT or Llama2

bash scripts/train_llm.sh DATASET LLM_CKPT

3️⃣ Inference

First prepare finetuned weight of BLIP2 on FlintStonesSV and PororoSV. Finetune BLIP2 by yourself or use our provided finetuned checkpoint captioner.pth under each dataset folder: BLIP2 FlintStonesSV, BLIP2 PororoSV.

Reproduce results using our model checkpoints:

FlintStonesSV: [First Stage] [Second Stage (OPT)] [Second Stage (Llama2)]

PororoSV: [First Stage] [Second Stage (OPT)]

To use Llama2, please first download the Llama2 checkpoints from Llama2. Then, in the 2nd checkpoints folder we provided, update the "llm_model" field in both args.json and model_args.json to the path of your local Llama2 folder.

# First Stage Evaluation
bash scripts/eval.sh DATASET 1st_CKPT_PATH

# Second Stage Evaluation
bash scripts/eval_llm.sh DATASET 1st_CKPT_PATH 2nd_CKPT_PATH

TODO

Training code
Evaluation code
Finetuned BLIP2 checkpoints for Evaluation
Model checkpoints

📕 Reference

Related repos BLIP2, FastComposer, GILL, SAM, DAAM

Baseline codes are from LDM, Story-LDM, StoryDALL-E

lilujunai / StoryGPT-V