Project Page | Paper | Demo | Checkpoints

UnIVAL is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.

Online Demos

Check out our demo on Huggingface Spaces: Spaces

General means the pretrained model before finetuning.

To easily play with our model we also provide several notebooks: VG.ipynb, VQA.ipynb, Captioning.ipynb, Video_Captioning.ipynb, and Audio_Captioning.ipynb

News

[2023.12]: paper is accepted at TMLR!
[2023.8.12]: we provide the scripts to train UnIVAL for audio/video-text tasks.
[2023.7.31]: we provide here more details to reproduce the results with UnIVAL on Visual Grounding used in our Rewarded soups work.
[2023.7.31]: Released of UnIVAL code and model weights! We will release the scripts to train and evaluate audio/video tasks later.

Table of Content

Quantitative Results
Installation
Datasets and Checkpoints
Training and Inference
Zero-shot Evaluation
Parameter Efficient Finetuning (PEFT): Training only the linear layer
Multimodal Model Merging/Weight Interpolation
Qualitative results
Citation
Acknowledgment

Results

Here are some results on several multimodal tasks.

Task	Visual Grounding			Image Captioning	VQA	Visual Entailment	VideoQA	Video Captioning	Audio Captioning
Dataset	RefCOCO	RefCOCO+	RefCOCOg	COCO	VQA v2	SNLI-VE	MSRVTT-QA	MSRVTT	AudioCaps
Split	val/test-a/test-b	val/test-a/test-b	val-u/test-u	Karpathy test	test-dev/test-std	val/test	test	test	test
Metric	Acc.			CIDEr	Acc.	Acc.	Acc.	CIDEr	CIDEr
UnIVAL	89.1 / 91.5 / 85.2	82.2 / 86.9 / 75.3	84.7 / 85.2	137.0	77.0 / 77.1	78.2 / 78.6	43.5	60.5	71.3

Installation

Requirements

python 3.7.4
pytorch 1.13+
torchvision 0.14.1+
JAVA 1.8 (for COCO evaluation)

We recommend to first install pytorch before other libraries:

git clone https://github.com/mshukor/UnIVAL.git
pip install -r requirements.txt

Download the following model for captioning evaluation:

python -c "from pycocoevalcap.spice.spice import Spice; tmp = Spice()"

Datasets and Checkpoints

See datasets.md and checkpoints.md.

Training and Inference

The scripts to launch pretraining, finetuning and evaluation can be found in run_scripts/ folder. Below we provide more details. The data are stored in .tsv files with different format depending on the training task. To restore training you need to provide the last checkpoint checkpoint_last.pt to --restore-file, and pass --reset-dataloader --reset-meters --reset-optimizer as argument.

We use slurm to launch the training/evaluation.

Image Processing

In some datasets, the images are encoded to base64 strings. To do this transformation you can use the following code:

from PIL import Image
from io import BytesIO
import base64

img = Image.open(file_name) # path to file
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data) # bytes
base64_str = base64_str.decode("utf-8") # str

Pretraining

1. Prepare the Dataset

The format for pretraining tsv files are as follows:

Each line contains uniq-id, image/video path, caption, question, answer, ground-truth objects (objects appearing in the caption or question), dataset name (source of the data) and task type (caption, qa or visual gronunding). Prepared for the pretraining tasks of visual grounding, grounded captioning, image-text matching, image captioning and visual question answering. In addition, the folder negative_sample contains three files all_captions.txt, object.txt and type2ans.json. The data in these files are used as negative samples for the image/video-text matching task.

2. Pretraining

There is 3 scripts to train UnIVAL. unival_s1.sh for stage 1 training initialized from BART weights, unival_s2.sh for stage 2 training, initialized from the weights after stage 1, and unival_s2_hs.sh for high-resolution training during 1 epoch, initialized from the weights of stage 2. For example to launch for stage 1:

cd run_scripts/pretraining
bash unival_s1.sh

Image Captioning

1. Prepare the Dataset & Checkpoints

Each image corresponds to only 1 caption in caption_stage1_train.tsv and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs.

162365  12455   the sun sets over the trees beyond some docks.  sky&&water&&dock&&pole  /9j/4AAQSkZJ....UCP/2Q==

2. Finetuning

To finetune for image captioning:

cd run_scripts/caption
sh unival_caption_stage_1.sh > unival_caption_stage_1.out

3. Inference

You can use the following code for inference, after setting the right weights path:

cd run_scripts/caption/eval ; sh eval_caption.sh  # inference & evaluate

Visual Question Answering

1. Prepare the Dataset & Checkpoints

Following common practice, VG-QA samples are also included in the training data. To adapt to the seq2seq paradigm of OFA, we transform original VQA training questions with multiple golden answers into multiple training samples. For the original VQA validation set, we keep around 10k samples for our validation and utilize the other samples for training. Each line of the dataset represents a VQA sample with the following format. The information of question-id, image-id, question, answer (with confidence), predicted object labels (taken from VinVL, slightly brings around +0.1 accuracy improvement), image base64 string are separated by tabs.

79459   79459   is this person wearing shorts?  0.6|!+no    house&&short&&...&&sky  /9j/4AAQS...tigZ/9k=

2. Shuffle the Training Data

(Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance.

cd dataset/vqa_data
ln vqa_train.tsv vqa_train_1.tsv
for idx in `seq 1 9`;do shuf vqa_train_${idx}.tsv > vqa_train_$[${idx}+1].tsv;done # each file is used for an epoch

3. Finetuning

If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments.

cd run_scripts/vqa
bash unival_vqa.sh

4. Inference

We use beam-search during inference.

cd run_scripts/vqa/eval
bash evaluate_vqa.sh  # specify 'val' or 'test' in the script

Visual Grounding

1. Prepare the Dataset & Checkpoints

We use RefCOCO (split by UNC), RefCOCO+ (split by UNC) and RefCOCOg (split by UMD) datasets. See RefCOCO and Refer for more details. Note that in the original dataset, each region-coord (or bounding box) may corresponds to multiple descriptive texts. We split these texts into multiple samples so that the region-coord in each sample corresponds to only one text. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, text, region-coord (separated by commas), image base64 string are separated by tabs.

79_1    237367  A woman in a white blouse holding a glass of wine.  230.79,121.75,423.66,463.06 9j/4AAQ...1pAz/9k=

2. Finetuning

cd run_scripts/refcoco
sh unival_refcoco.sh > train_refcoco.out &  # finetune for refcoco
sh unival_refcocoplus.sh > train_refcocoplus.out &  # finetune for refcoco+
sh unival_refcocog.sh > train_refcocog.out &  # finetune for refcocog

3. Inference

Run the following commands for the evaluation.

cd run_scripts/refcoco/eval ; sh eva_refcoco.sh  # eva_refcocog.sh, eva_refcocoplus.sh

Visual Entailment

1. Prepare the Dataset & Checkpoints

Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs.

252244149.jpg#1r1n  252244149   /9j/4AAQ...MD/2Q==   a man in pink and gold is chewing on a wooden toothpick.   a man in pink is chewing a toothpick on the subway.   neutral

2. Finetuning

Contrary to previous work (e.g. OFA) we do not use the text premise for this task.

cd run_scripts/snli_ve
nohup sh unival_snli_ve.sh > train_snli_ve.out &  # finetune for snli_ve

3. Inference

Run the following command to obtain the results.

cd run_scripts/snli_ve/eval ; sh eval_snli_ve.sh  # specify 'dev' or 'test' in the script

Text-to-Image Generation

1. Prepare the Dataset & Checkpoints

The dataset zipfile coco_image_gen.zip contains coco_vqgan_train.tsv, coco_vqgan_dev.tsv and coco_vqgan_full_test.tsv. Each line of the dataset represents a sample with the following format. The information of uniq-id, image-code (produced by vqgan, a list of integers separated by single-whitespaces), lowercased caption are separated by tabs.

1	6674 4336 4532 5334 3251 5461 3615 2469 ...4965 4190 1846	the people are posing for a group photo.

The checkpoint zipfile image_gen_large_best.zip contains image_gen_large_best.pt, vqgan/last.ckpt, vqgan/model.yaml and clip/Vit-B-16.pt.

2. Finetuning

We divide the finetuning process of image generating into two stages. In stage 1, we finetune OFA with cross-entropy loss. In stage 2, we select the last checkpoint of stage 1 and train with CLIP Score optimization. During the validation, the generated image will be dumped into _GEN_IMAGE_PATH_.

cd run_scripts/image_gen
nohup sh unival_image_gen_stage_1.sh # stage 1, train with cross-entropy loss
nohup sh unival_image_gen_stage_2.sh # stage 2, load the last ckpt of stage1 and train with CLIP Score optimization

4. Inference

Run the command below to generate your images.

cd run_scripts/image_gen/eval ; sh eval_image_gen.sh  # inference & evaluate (FID, IS and CLIP Score)

Zero-shot Evaluation

Here we provide the scripts for zero-shot evaluation on image-text tasks. You need to specify the path to pretrained model in each of these scripts:

Image Caption on Nocaps: caption/eval/eval_nocaps.sh
VQA on VizWiz: vqa/eval/eval_vizwiz.sh
VQA on Nocaps: vqa/eval/eval_okvqa.sh

Parameter Efficient Finetuning

Training only the linear connection

Following eP-ALM, we experiment with efficient finetuning by training only the linear connection between the modality spcific-encoders and the language model, while keeping all other parameters frozen:

Image Caption on COCO: caption/onlylinear/unival_caption_stage_s2_onlylinear.sh
Video Caption on MSRVTT: caption/onlylinear/unival_video_caption_stage_s2_onlylinear.sh
Audio Caption on Audiocaps: caption/onlylinear/unival_audio_caption_stage_s2_onlylinear.sh
VQA on VQAv2: vqa/onlylinear/unival_vqa_s2_onlylinear.sh
Video QA on MSRVTT: vqa/onlylinear/unival_video_vqa_s2_onlylinear.sh

To finetune the stage-1 pretrained model, you can use the scripts with s1.

Multimodal Model Merging

In this section we provide the details to reproduce the experiments for weight interpolation and different weight averaging experiments. The objective is to leverage the synergy between models finetuned on different multimodal tasks.

Weight interpolation

To average several models, you can use preprocess/average_save_models.py. There is two options, either you average many models with uniform interpolation coefficient, or you interpolate between 2 models with interpolation coefficient from 0 to 1. However, you can also customise this script as you like.

Once you saved the interpolated weights, you can use the following scripts to evaluate the model:

## image-text tasks
sh caption/eval/eval_caption_avg.sh
sh refcoco/eval/eval_refcocoplus_avg.sh
sh snli_ve/eval/eval_snli_ve_avg.sh
sh vqa/eval/eval_vqa_avg.sh

## video-text tasks 
sh vqa/eval/video/eval_video_qa_avg.sh
sh caption/eval/video/eval_msrvtt_video_caption_avg.sh

Ratatouille Finetuning

For Ratatouille finetuning, each one of the auxiliary models (e.g. models finetuned for captioning, vqa, visual grounding and visual entailment) are re-finetuned on the target task. At the end all obtained models are uniformly averaged.

The scripts to launch the finetuning and evaluation are in averaging/ratatouille/. You need also to use the weight averaging script in preprocess/average_save_models.py.

Fusing Finetuning

For Fusing finetuning, first the auxiliary models are averaged, then finetuned on the target task.

The scripts to launch the finetuning and evaluation are in averaging/fusing/.

Qualitative Results

Below we provide qualitative results for some tasks.

Visual Grounding

Image Captioning

Open-Ended VQA

Citation

If you find the work helpful, you can cite it using the following citation:

@article{
shukor2023unival,
title={Un{IVAL}: Unified Model for Image, Video, Audio and Language Tasks},
author={Mustafa Shukor and Corentin Dancette and Alexandre Rame and Matthieu Cord},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=4uflhObpcp},
note={}
}

Aknowledgment

This code is based mainly on the following repos:

We thank the authors for releasing their code.

mshukor / UnIVAL

Online Demos

News

Table of Content

Results

Installation

Requirements

Datasets and Checkpoints

Training and Inference

Image Processing

Pretraining

Image Captioning

Visual Question Answering

Visual Grounding

Visual Entailment

Text-to-Image Generation

Zero-shot Evaluation

Parameter Efficient Finetuning

Training only the linear connection

Multimodal Model Merging

Weight interpolation

Ratatouille Finetuning

Fusing Finetuning

Qualitative Results

Visual Grounding

Image Captioning

Open-Ended VQA

Citation

Aknowledgment

About

Languages