Official dataset and codes for CompBench.
CompBench is a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench comprises around 40K image pairs collected from a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance.
Check our project page and paper for key contributions and findings.
- Dataset
- Preparing images
- Preparing question-answer pairs
- Model evaluation
Images in CompBench are collected from fourteen publicly available datasets.
Download images from [here].
Store the images into dataset/transformed_states
.
dataset
├── transformed_states
│ ├── release_dataset
│ │ ├── images
│ │ │ ├── deflated ball
Reference: [link]
Download images from [here].
Download annotations from [here].
Store the images and the annotations into dataset/fashionpedia
.
Run preprocessing/fashionpedia/extract_adj_obj.py
to group images by the object (e.g., dress) with adjective (argyle).
Output directory: dataset/fashionpedia/adj_obj_folder_pairs_images
.
dataset
├── fashionpedia
│ ├── test
│ ├── instances_attributes_val2020.json
│ ├── adj_obj_folder_pairs_images
│ │ ├── val
│ │ │ ├── straight_pants-loose (fit)_pants
Reference: [link]
Follow [here] to download Visual Genome (VG) part1 and part2 images.
Store its part1 images into dataset/vaw/images/VG_100K
.
Store its part2 images into dataset/vaw/images/VG_100K_2
.
Download a Val annotation file (i.e., val.json) from [here].
Store the annotation file into dataset/vaw/data
.
Run preprocessing/vaw/extract_imgs_vaw.py
to group images by the object (e.g., hair) with adjective (wet).
Output directory: dataset/vaw/att_obj_images
.
dataset
├── vaw
│ ├── images
│ ├── data
│ ├── att_obj_images
│ │ ├── val
│ │ │ ├── wet hair
Reference: [link]
Download images and annotations from [here].
Store them into dataset/cub_200_2011
.
Run preprocessing/cub_200_2011/extract_imgs.py
to group images by the bird species (e.g., Laysan_Albatross) with adjective (curved bill).
Output directory: dataset/cub_200_2011/att_cls_images
.
dataset
├── cub_200_2011
│ ├── images
│ ├── parts
│ ├── segmentations
│ ├── attributes.txt
│ ├── att_cls_images
│ │ ├── test
│ │ │ ├── 1_2
Reference: [link]
Please ask the author of Wildfish++ to download pairs of images (i.e., Pair_images) and its annotations (i.e., fine_grained).
Store the annotation file into dataset/wildfish
.
Run preprocessing/wildfish/extract_imgs_wf.py
to group images by two similar fish species (e.g., Amphiprion_akindynos-Amphiprion_chrysopterus).
Output directory: dataset/wildfish/diff_images
.
dataset
├── wildfish
│ ├── Pair_images
│ ├── fine_grained
│ ├── diff_images
│ │ ├── val
│ │ │ ├── Amphiprion_akindynos-Amphiprion_chrysopterus
Reference: [link]
Download Dev images from [here].
Store them into dataset/magic_brush
.
dataset
├── magic_brush
│ ├── dev
│ │ ├── images
Reference: [link]
Download scenes from [here].
Download the test annotation file (i.e., test.json) from [here].
Store the scenes and the annotation into dataset/spot-the-diff
.
Run preprocessing/spot-the-diff/extract_imgs_sd.py
to generate a pair of two similar scenes.
Output directory: dataset/spot-the-diff/pair_images
.
dataset
├── spot-the-diff
│ ├── resized_images
│ ├── test.json
│ ├── pair_images
│ │ ├── test
Reference: [link]
Download img_align_celeba.zip from [here].
Unzip the file and store it into dataset/celeba
.
Download list_eval_partition.txt from [here].
Store it into dataset/celeba
.
Download list_attr_celeba.txt from [here].
Store it into dataset/celeba
.
Run preprocessing/celeba/extract_imgs.py
to group images by adjectives (e.g., smiling).
Output directory: dataset/celeba/adj_images
.
dataset
├── celeba
│ ├── img_align_celeba
│ ├── list_eval_partition.txt
│ ├── list_attr_celeba.txt
│ ├── adj_images
│ │ ├── test
│ │ │ ├── Smiling
Reference: [link]
Download test images from [here].
Store them into dataset/fer-2013
.
dataset
├── fer-2013
│ ├── test
│ │ ├── angry
Reference: [link]
Check preprocessing/soccernet/download_soccernet.py
to download videos and their labels.
Store them into dataset/soccernet
.
Run preprocessing/soccernet/extract_temp_imgs.py
to extract a pair of frames from the action (e.g., corner-kick).
Output directory: */*_frames_actions
.
dataset
├── soccernet
│ ├── val
│ │ ├── england_epl
│ │ │ ├── 2014-2015
│ │ │ │ ├── 2015-04-11 - 19-30 Burnley 0 - 1 Arsenal
│ │ │ │ │ ├── 1_frames
│ │ │ │ │ ├── 1_frames_actions
│ │ │ │ │ ├── 2_frames
│ │ │ │ │ ├── 2_frames_actions
│ │ │ │ │ ├── Labels-v2.json
│ │ │ │ │ ├── 1_224p.mkv
│ │ │ │ │ ├── 2_224p.mkv
Reference: [link]
Follow [here] to download images.
Run preprocessing/comp_cars/extract_objects_comp_cars.py
to group vehicle images by the make, model, and released year.
Output directory: dataset/comp_cars/test_images
.
dataset
├── comp_cars
│ ├── data
│ │ ├── image
│ │ ├── label
│ │ ├── misc
│ │ ├── part
│ │ ├── train_test_split
│ ├── test_image
│ │ ├── make_id
│ │ │ ├── model_id
│ │ │ │ ├── released_year
Reference: [link]
Download Labeled dataset from [here].
Follow [here] to convert from mat to image.
Download list_test.txt from [here].
Run preprocessing/nyu_depth_v2/extract_objects_nyu.py
to group images by the object (e.g., air conditioner).
Output directory: dataset/nyu_depth_v2/obj_images
.
dataset
├── nyu_depth_v2
│ ├── image
│ ├── list_test.txt
│ ├── obj_images
│ │ ├── test
│ │ │ ├── air_conditioner
Reference: [link]
Download VQAv2 data from [here]
Run preprocessing/vqav2/extract_counting_imgs_vqav2.py
to group samples by the counting questions.
Output directory: dataset/vqav2/counting_images
.
dataset
├── vqav2
│ ├── images
│ │ ├── train2014
│ │ ├── val2014
│ ├── v2_mscoco_train2014_annotations.json
│ ├── v2_mscoco_val2014_annotations.json
│ ├── v2_OpenEnded_mscoco_train2014_questions.json
│ ├── v2_OpenEnded_mscoco_val2014_questions.json
│ ├── counting_images
│ │ ├── train2014
│ │ ├── val2014
Reference: [link]
Download Dev data from [here].
dataset
├── q-bench2
│ ├── q-bench2-a1-dev.jsonl
│ ├── llvisionqa_compare_dev
Reference: [link]
All annotated pairs are available under [here]. Concretely,
- MIT-States:
st_label
- Fashionpedia:
fashion_label
- VAW:
vaw_label
- CUB-200-2011:
cub_label
- Wildfish:
Wildfish_label
- MagicBrush:
mb_label
- Spot-the-diff:
spot_difference_label
- CelebA:
celebA_label
- FER-2013:
fer2013_label
- SoccerNet:
soccernet_label
- CompCars:
car_label
- NYU-Depth V2:
depth_label
- VQAv2:
vqav2_label
- Q-bench2:
qbench2_label
Each dataset has one annotated JSON file, which contains a list of dictionaries. Each dictionary represents the annotation for a pair of images.
CompCars, CelebA, CUB-200-2011, NYU-Depth V2, Fashionpedia, FER-2013, MIT-States, VAW and VQAv2 have the following keys in the annotation:
image_1
: First imageimage_2
: Second imagequestion
: Question about relativity between the two imagesanswer
: Correct answer. "Right" indicates that the second image is correct and "Left" means that the first image is correct
Note: All questions in CompCars are the same, so it does not have the question
key. The question is "Based on these images, which car is newer in terms of its model year or release year?"
The term "newer" is related to the year each car was manufactured or released, not its current condition or usage."
Note: MIT-States and VAW have an additional key type
, which can be 'Size', 'Color', 'Pattern', 'Texture', 'Shape', or 'State'.
The 'Size', 'Color', 'Pattern', 'Texture', and 'Shape' are common visual attributes.
MagicBrush and Spot-the-diff consist of multi-choice questions where models need to select one of the provided options.
image_1
: First imageimage_2
: Second imageoptions
: Options related to imagesanswer
: Correct answer
Question for MagicBrush and Spot-the-diff: "What is the most obvious difference between two images? Choose from the following options. If there is no obvious difference, choose None. Options: None,{pair['options']}
. Please only return one of the options without any other words."
MagicBrush has additional keys:
image_1_caption
: Caption for the first imageimage_2_caption
: Caption for the second imageCLIP_similarity
: CLIP similarity between the two images
Q-bench2 contains multi-choice questions that models need to select one of the provided options. The authors of Q-bench2 have already combined two images into one single image.
image
: Combined imagequestion
: Question about relativity between the two imagesoptions
: Options related to imagesanswer
: Correct answer
SoccerNet has the following keys:
image_1
: First frameimage_2
: Second frameanswer
: Correct answer. "Right" indicates that the second frame is correct and "Left" means that the first frame is correctCLIP_similarity
: CLIP similarity between the two framesaction
: Soccer action related to the two framesmatch
: Soccer match related to the two framesleague
: Soccer League related to the two frames
Question: "These are two frames related to {pair['action']}
in a soccer match. Which frame happens first? Please only return one option from (Left, Right, None) without any other words. If these two frames are exactly the same, select None. Otherwise, choose Left if the first frame happens first and select Right if the second frame happens first."
If you find the code and data useful, please cite the following paper:
@article{kil2024compbench,
title={CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs},
author={Kil, Jihyung and Mai, Zheda and Lee, Justin and Wang, Zihe and Cheng, Kerrie and Wang, Lemeng and Liu, Ye and Chowdhury, Arpita and Chao, Wei-Lun},
journal={arXiv preprint arXiv:2407.16837},
year={2024}
}