LivXue / VCNLG

Vision-Controllable Natural Language Generation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vision-Controllable Natural Language Generation

Dizhan Xue, Shengsheng Qian, and Changsheng Xu.

MAIS, Institute of Automation, Chinese Academy of Sciences

GitHub stars Hits

Examples

example1 example2
example3 example4
example5 example6
example7 example8

Introduction

  • Vision-Controllable Natural Language Generation (VCNLG) aims to continue natural language generation (NLG) following a peceived visual control.
  • Vision-Controllable Language Model (VCLM) aligns a frozen vsiual encoder from BLIP, a frozen textual encoder BERT, and a trained-from-scratch or pretrained generative language model (LM).
  • VCLM adopt a (optional) multimodal-contextual cloud knowledge retrieval to improve edge computing AI when additional knowledge is needed.
  • VCLM adopt vision-controlled reinforcement learning to constrain the trained model to follow visual controls.

overview

Getting Started

1. Prepare the code and the environment

Git clone our repository, creating a python environment and activate it via the following command

git clone https://github.com/LivXue/VCNLG.git
cd VCNLG
conda env create -f environment.yml
conda activate vcnlg

We adopt ViT pretrained by BLIP to extract visual features. Download the weights of BLIP w/ ViT-L and save the file to visual_feature_extraction/checkpoints/model_large.pth

2. Prepare the datasets

VIST-E [Link]

Download SIS-with-labels.tar.gz, train_split.(0-12).tar.gz, val_images.tar.gz, test_images.tar.gz and unzip them into data/VIST-E.

NOTE: There should be train.story-in-sequence.json, val.story-in-sequence.json, test.story-in-sequence.json in data/VIST-E/ and image_id.jpg/png in data/VIST-E/images/.

Then, run

python visual_feature_extraction/extract_fea_img.py --input_dir data/VIST-E/images --output_dir data/VIST-E/ViT_features --device <your device>

to extract the ViT features of images.

Then, run

python data/VIST-E/prepare_data.py --images_directory data/VIST-E/ViT_features --device <your device>

to generate the story files.

Finally, run

python data/VIST-E/extract_clip_feature.py --input_dir data/VIST-E/images --output_dir data/VIST-E/clip_features

to generate clip features.

NOTE: There should be story_train.json, story_val.json, story_test.json in data/VIST-E/, <image_id>.npy in data/VIST-E/ViT_features/, and <image_id>.npy in data/VIST-E/clip_features/.

LSMDC-E [Link]

Download LSMDC 2021 version (task1_2021.zip, MPIIMD_downloadLinks.txt, MVADaligned_downloadLinks.txt) and unzip them into data/LSMDC-E.

NOTE: Due to LSMDC agreement, we cannot share data to any third-party.

NOTE: There should be LSMDC16_annos_training_someone.csv, LSMDC16_annos_val_someone.csv, LSMDC16_annos_test_someone.csv, MPIIMD_downloadLinks.txt, MVADaligned_downloadLinks.txt in data/LSMDC-E/.

Then, merge MPIIMD_downloadLinks.txt and MVADaligned_downloadLinks.txt to a download_video_urls.txt file, modify the user name and password to LSMDC in data/LSMDC-E/generate_clips.py and run

python data/LSMDC-E/generate_clips.py --output_path data/LSMDC-E/videos --user_name <your user name to LSMDC> --password <your password to LSMDC>

to download videos and save resampled frames in to data/LSMDC-E/videos.

Then, run

python visual_feature_extraction/extract_fea_video.py --input_dir data/LSMDC-E/videos --output_dir data/LSMDC-E/ViT_features --device <your device>

to extract the ViT features of video frames.

Then, run

python data/LSMDC-E/prepare_data.py --input_path data/LSMDC-E

to generate the story files.

Finally, run

python data/LSMDC-E/extract_clip_feature_video.py --input_dir data/LSMDC-E/videos --output_dir data/LSMDC-E/clip_features

to generate clip features.

NOTE: There should be story_train.json, story_val.json, story_test.json in data/LSMDC-E/, <video_id>.npy in data/LSMDC-E/ViT_features/, and <video_id>.npy in data/LSMDC-E/clip_features/.

3. (Optional) Fetch Textual Knowledge

Download the code and pretrained checkpoints of mPLUG-Owl.

Then, run our script

python mPLUG-Owl/test_onshot.py

to retrieve knowledge for the datasets.

Training and Test

Check the configs in utils/opts.py and run

python train.py --dataset <dataset>

to train the model.

Then, run

python eval.py --dataset <dataset>

to test the model.

Launching Demo Locally

Coming soon...

Our Results

We provide our results generated by VCLM on VIST-E and LSMDC-E test sets in results/

License

This repository is under BSD 3-Clause License.

About

Vision-Controllable Natural Language Generation

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 100.0%