CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Official dataset and codes for CompBench.

Benchmark Summary

CompBench is a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench comprises around 40K image pairs collected from a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance.

Check our project page and paper for key contributions and findings.

Release Process

Dataset
- Preparing images
- Preparing question-answer pairs
Model evaluation

Preparing images

Images in CompBench are collected from fourteen publicly available datasets.

MIT-States

Download images from [here].

Store the images into dataset/transformed_states.

dataset
├── transformed_states
│ ├── release_dataset
│ │ ├── images
│ │ │ ├── deflated ball

Reference: [link]

Fashionpedia

Download images from [here].

Download annotations from [here].

Store the images and the annotations into dataset/fashionpedia.

Run preprocessing/fashionpedia/extract_adj_obj.py to group images by the object (e.g., dress) with adjective (argyle).

Output directory: dataset/fashionpedia/adj_obj_folder_pairs_images.

dataset
├── fashionpedia
│ ├── test
│ ├── instances_attributes_val2020.json
│ ├── adj_obj_folder_pairs_images
│ │ ├── val
│ │ │ ├── straight_pants-loose (fit)_pants

Reference: [link]

VAW

Follow [here] to download Visual Genome (VG) part1 and part2 images.

Store its part1 images into dataset/vaw/images/VG_100K. Store its part2 images into dataset/vaw/images/VG_100K_2.

Download a Val annotation file (i.e., val.json) from [here].

Store the annotation file into dataset/vaw/data.

Run preprocessing/vaw/extract_imgs_vaw.py to group images by the object (e.g., hair) with adjective (wet).

Output directory: dataset/vaw/att_obj_images.

dataset
├── vaw
│ ├── images
│ ├── data
│ ├── att_obj_images
│ │ ├── val
│ │ │ ├── wet hair

Reference: [link]

CUB-200-2011

Download images and annotations from [here].

Store them into dataset/cub_200_2011.

Run preprocessing/cub_200_2011/extract_imgs.py to group images by the bird species (e.g., Laysan_Albatross) with adjective (curved bill).

Output directory: dataset/cub_200_2011/att_cls_images.

dataset
├── cub_200_2011
│ ├── images
│ ├── parts
│ ├── segmentations
│ ├── attributes.txt
│ ├── att_cls_images
│ │ ├── test
│ │ │ ├── 1_2

Reference: [link]

Wildfish

Please ask the author of Wildfish++ to download pairs of images (i.e., Pair_images) and its annotations (i.e., fine_grained).

Store the annotation file into dataset/wildfish.

Run preprocessing/wildfish/extract_imgs_wf.py to group images by two similar fish species (e.g., Amphiprion_akindynos-Amphiprion_chrysopterus).

Output directory: dataset/wildfish/diff_images.

dataset
├── wildfish
│ ├── Pair_images
│ ├── fine_grained
│ ├── diff_images
│ │ ├── val
│ │ │ ├── Amphiprion_akindynos-Amphiprion_chrysopterus

Reference: [link]

MagicBrush

Download Dev images from [here].

Store them into dataset/magic_brush.

dataset
├── magic_brush
│ ├── dev
│ │ ├── images

Reference: [link]

Spot-the-diff

Download scenes from [here].

Download the test annotation file (i.e., test.json) from [here].

Store the scenes and the annotation into dataset/spot-the-diff.

Run preprocessing/spot-the-diff/extract_imgs_sd.py to generate a pair of two similar scenes.

Output directory: dataset/spot-the-diff/pair_images.

dataset
├── spot-the-diff
│ ├── resized_images
│ ├── test.json
│ ├── pair_images
│ │ ├── test

Reference: [link]

CelebA

Download img_align_celeba.zip from [here].

Unzip the file and store it into dataset/celeba.

Download list_eval_partition.txt from [here].

Store it into dataset/celeba.

Download list_attr_celeba.txt from [here].

Store it into dataset/celeba.

Run preprocessing/celeba/extract_imgs.py to group images by adjectives (e.g., smiling).

Output directory: dataset/celeba/adj_images.

dataset
├── celeba
│ ├── img_align_celeba
│ ├── list_eval_partition.txt
│ ├── list_attr_celeba.txt
│ ├── adj_images
│ │ ├── test
│ │ │ ├── Smiling

Reference: [link]

FER-2013

Download test images from [here].

Store them into dataset/fer-2013.

dataset
├── fer-2013
│ ├── test
│ │ ├── angry

Reference: [link]

SoccerNet

Check preprocessing/soccernet/download_soccernet.py to download videos and their labels. Store them into dataset/soccernet.

Run preprocessing/soccernet/extract_temp_imgs.py to extract a pair of frames from the action (e.g., corner-kick).

Output directory: */*_frames_actions.

dataset
├── soccernet
│ ├── val
│ │ ├── england_epl
│ │ │ ├── 2014-2015
│ │ │ │ ├── 2015-04-11 - 19-30 Burnley 0 - 1 Arsenal
│ │ │ │ │ ├── 1_frames
│ │ │ │ │ ├── 1_frames_actions
│ │ │ │ │ ├── 2_frames
│ │ │ │ │ ├── 2_frames_actions
│ │ │ │ │ ├── Labels-v2.json
│ │ │ │ │ ├── 1_224p.mkv
│ │ │ │ │ ├── 2_224p.mkv

Reference: [link]

CompCars

Follow [here] to download images.

Run preprocessing/comp_cars/extract_objects_comp_cars.py to group vehicle images by the make, model, and released year.

Output directory: dataset/comp_cars/test_images.

dataset
├── comp_cars
│ ├── data
│ │ ├── image
│ │ ├── label
│ │ ├── misc
│ │ ├── part
│ │ ├── train_test_split
│ ├── test_image
│ │ ├── make_id
│ │ │ ├── model_id
│ │ │ │ ├── released_year

Reference: [link]

NYU-Depth V2

Download Labeled dataset from [here].

Follow [here] to convert from mat to image.

Download list_test.txt from [here].

Run preprocessing/nyu_depth_v2/extract_objects_nyu.py to group images by the object (e.g., air conditioner).

Output directory: dataset/nyu_depth_v2/obj_images.

dataset
├── nyu_depth_v2
│ ├── image
│ ├── list_test.txt
│ ├── obj_images
│ │ ├── test
│ │ │ ├── air_conditioner

Reference: [link]

VQAv2

Download VQAv2 data from [here]

Run preprocessing/vqav2/extract_counting_imgs_vqav2.py to group samples by the counting questions.

Output directory: dataset/vqav2/counting_images.

dataset
├── vqav2
│ ├── images
│ │ ├── train2014
│ │ ├── val2014
│ ├── v2_mscoco_train2014_annotations.json
│ ├── v2_mscoco_val2014_annotations.json
│ ├── v2_OpenEnded_mscoco_train2014_questions.json
│ ├── v2_OpenEnded_mscoco_val2014_questions.json
│ ├── counting_images
│ │ ├── train2014
│ │ ├── val2014

Reference: [link]

Q-Bench2

Download Dev data from [here].

dataset
├── q-bench2
│ ├── q-bench2-a1-dev.jsonl
│ ├── llvisionqa_compare_dev

Reference: [link]

Preparing question-answer pairs

All annotated pairs are available under [here]. Concretely,

MIT-States: st_label
Fashionpedia: fashion_label
VAW: vaw_label
CUB-200-2011: cub_label
Wildfish: Wildfish_label
MagicBrush: mb_label
Spot-the-diff: spot_difference_label
CelebA: celebA_label
FER-2013: fer2013_label
SoccerNet: soccernet_label
CompCars: car_label
NYU-Depth V2: depth_label
VQAv2: vqav2_label
Q-bench2: qbench2_label

Each dataset has one annotated JSON file, which contains a list of dictionaries. Each dictionary represents the annotation for a pair of images.

CompCars, CelebA, CUB-200-2011, NYU-Depth V2, Fashionpedia, FER-2013, MIT-States, VAW and VQAv2 have the following keys in the annotation:

image_1: First image
image_2: Second image
question: Question about relativity between the two images
answer: Correct answer. "Right" indicates that the second image is correct and "Left" means that the first image is correct

Note: All questions in CompCars are the same, so it does not have the question key. The question is "Based on these images, which car is newer in terms of its model year or release year?" The term "newer" is related to the year each car was manufactured or released, not its current condition or usage."

Note: MIT-States and VAW have an additional key type, which can be 'Size', 'Color', 'Pattern', 'Texture', 'Shape', or 'State'. The 'Size', 'Color', 'Pattern', 'Texture', and 'Shape' are common visual attributes.

MagicBrush and Spot-the-diff consist of multi-choice questions where models need to select one of the provided options.

image_1: First image
image_2: Second image
options: Options related to images
answer: Correct answer

Question for MagicBrush and Spot-the-diff: "What is the most obvious difference between two images? Choose from the following options. If there is no obvious difference, choose None. Options: None,{pair['options']}. Please only return one of the options without any other words."

MagicBrush has additional keys:

image_1_caption: Caption for the first image
image_2_caption: Caption for the second image
CLIP_similarity: CLIP similarity between the two images

Q-bench2 contains multi-choice questions that models need to select one of the provided options. The authors of Q-bench2 have already combined two images into one single image.

image: Combined image
question: Question about relativity between the two images
options: Options related to images
answer: Correct answer

SoccerNet has the following keys:

image_1: First frame
image_2: Second frame
answer: Correct answer. "Right" indicates that the second frame is correct and "Left" means that the first frame is correct
CLIP_similarity: CLIP similarity between the two frames
action: Soccer action related to the two frames
match: Soccer match related to the two frames
league: Soccer League related to the two frames

Question: "These are two frames related to {pair['action']} in a soccer match. Which frame happens first? Please only return one option from (Left, Right, None) without any other words. If these two frames are exactly the same, select None. Otherwise, choose Left if the first frame happens first and select Right if the second frame happens first."

Citation

If you find the code and data useful, please cite the following paper:

@article{kil2024compbench,
  title={CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs},
  author={Kil, Jihyung and Mai, Zheda and Lee, Justin and Wang, Zihe and Cheng, Kerrie and Wang, Lemeng and Liu, Ye and Chowdhury, Arpita and Chao, Wei-Lun},
  journal={arXiv preprint arXiv:2407.16837},
  year={2024}
}

RaptorMai / CompBench

CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Benchmark Summary

Release Process

Preparing images

MIT-States

Fashionpedia

VAW

CUB-200-2011

Wildfish

MagicBrush

Spot-the-diff

CelebA

FER-2013

SoccerNet

CompCars

NYU-Depth V2

VQAv2

Q-Bench2

Preparing question-answer pairs

Citation

About

Languages