RaptorMai / CompBench

CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.

Home Page:https://compbench.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Official dataset and codes for CompBench.

Benchmark Summary

CompBench is a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench comprises around 40K image pairs collected from a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance.

Check our project page and paper for key contributions and findings.

Release Process

  • Dataset
    • Preparing images
    • Preparing question-answer pairs
  • Model evaluation

Preparing images

Images in CompBench are collected from fourteen publicly available datasets.

MIT-States

Download images from [here].

Store the images into dataset/transformed_states.

dataset
├── transformed_states
│ ├── release_dataset
│ │ ├── images
│ │ │ ├── deflated ball

Reference: [link]

Fashionpedia

Download images from [here].

Download annotations from [here].

Store the images and the annotations into dataset/fashionpedia.

Run preprocessing/fashionpedia/extract_adj_obj.py to group images by the object (e.g., dress) with adjective (argyle).

Output directory: dataset/fashionpedia/adj_obj_folder_pairs_images.

dataset
├── fashionpedia
│ ├── test
│ ├── instances_attributes_val2020.json
│ ├── adj_obj_folder_pairs_images
│ │ ├── val
│ │ │ ├── straight_pants-loose (fit)_pants

Reference: [link]

VAW

Follow [here] to download Visual Genome (VG) part1 and part2 images.

Store its part1 images into dataset/vaw/images/VG_100K. Store its part2 images into dataset/vaw/images/VG_100K_2.

Download a Val annotation file (i.e., val.json) from [here].

Store the annotation file into dataset/vaw/data.

Run preprocessing/vaw/extract_imgs_vaw.py to group images by the object (e.g., hair) with adjective (wet).

Output directory: dataset/vaw/att_obj_images.

dataset
├── vaw
│ ├── images
│ ├── data
│ ├── att_obj_images
│ │ ├── val
│ │ │ ├── wet hair

Reference: [link]

CUB-200-2011

Download images and annotations from [here].

Store them into dataset/cub_200_2011.

Run preprocessing/cub_200_2011/extract_imgs.py to group images by the bird species (e.g., Laysan_Albatross) with adjective (curved bill).

Output directory: dataset/cub_200_2011/att_cls_images.

dataset
├── cub_200_2011
│ ├── images
│ ├── parts
│ ├── segmentations
│ ├── attributes.txt
│ ├── att_cls_images
│ │ ├── test
│ │ │ ├── 1_2

Reference: [link]

Wildfish

Please ask the author of Wildfish++ to download pairs of images (i.e., Pair_images) and its annotations (i.e., fine_grained).

Store the annotation file into dataset/wildfish.

Run preprocessing/wildfish/extract_imgs_wf.py to group images by two similar fish species (e.g., Amphiprion_akindynos-Amphiprion_chrysopterus).

Output directory: dataset/wildfish/diff_images.

dataset
├── wildfish
│ ├── Pair_images
│ ├── fine_grained
│ ├── diff_images
│ │ ├── val
│ │ │ ├── Amphiprion_akindynos-Amphiprion_chrysopterus

Reference: [link]

MagicBrush

Download Dev images from [here].

Store them into dataset/magic_brush.

dataset
├── magic_brush
│ ├── dev
│ │ ├── images

Reference: [link]

Spot-the-diff

Download scenes from [here].

Download the test annotation file (i.e., test.json) from [here].

Store the scenes and the annotation into dataset/spot-the-diff.

Run preprocessing/spot-the-diff/extract_imgs_sd.py to generate a pair of two similar scenes.

Output directory: dataset/spot-the-diff/pair_images.

dataset
├── spot-the-diff
│ ├── resized_images
│ ├── test.json
│ ├── pair_images
│ │ ├── test

Reference: [link]

CelebA

Download img_align_celeba.zip from [here].

Unzip the file and store it into dataset/celeba.

Download list_eval_partition.txt from [here].

Store it into dataset/celeba.

Download list_attr_celeba.txt from [here].

Store it into dataset/celeba.

Run preprocessing/celeba/extract_imgs.py to group images by adjectives (e.g., smiling).

Output directory: dataset/celeba/adj_images.

dataset
├── celeba
│ ├── img_align_celeba
│ ├── list_eval_partition.txt
│ ├── list_attr_celeba.txt
│ ├── adj_images
│ │ ├── test
│ │ │ ├── Smiling

Reference: [link]

FER-2013

Download test images from [here].

Store them into dataset/fer-2013.

dataset
├── fer-2013
│ ├── test
│ │ ├── angry

Reference: [link]

SoccerNet

Check preprocessing/soccernet/download_soccernet.py to download videos and their labels. Store them into dataset/soccernet.

Run preprocessing/soccernet/extract_temp_imgs.py to extract a pair of frames from the action (e.g., corner-kick).

Output directory: */*_frames_actions.

dataset
├── soccernet
│ ├── val
│ │ ├── england_epl
│ │ │ ├── 2014-2015
│ │ │ │ ├── 2015-04-11 - 19-30 Burnley 0 - 1 Arsenal
│ │ │ │ │ ├── 1_frames
│ │ │ │ │ ├── 1_frames_actions
│ │ │ │ │ ├── 2_frames
│ │ │ │ │ ├── 2_frames_actions
│ │ │ │ │ ├── Labels-v2.json
│ │ │ │ │ ├── 1_224p.mkv
│ │ │ │ │ ├── 2_224p.mkv

Reference: [link]

CompCars

Follow [here] to download images.

Run preprocessing/comp_cars/extract_objects_comp_cars.py to group vehicle images by the make, model, and released year.

Output directory: dataset/comp_cars/test_images.

dataset
├── comp_cars
│ ├── data
│ │ ├── image
│ │ ├── label
│ │ ├── misc
│ │ ├── part
│ │ ├── train_test_split
│ ├── test_image
│ │ ├── make_id
│ │ │ ├── model_id
│ │ │ │ ├── released_year

Reference: [link]

NYU-Depth V2

Download Labeled dataset from [here].

Follow [here] to convert from mat to image.

Download list_test.txt from [here].

Run preprocessing/nyu_depth_v2/extract_objects_nyu.py to group images by the object (e.g., air conditioner).

Output directory: dataset/nyu_depth_v2/obj_images.

dataset
├── nyu_depth_v2
│ ├── image
│ ├── list_test.txt
│ ├── obj_images
│ │ ├── test
│ │ │ ├── air_conditioner

Reference: [link]

VQAv2

Download VQAv2 data from [here]

Run preprocessing/vqav2/extract_counting_imgs_vqav2.py to group samples by the counting questions.

Output directory: dataset/vqav2/counting_images.

dataset
├── vqav2
│ ├── images
│ │ ├── train2014
│ │ ├── val2014
│ ├── v2_mscoco_train2014_annotations.json
│ ├── v2_mscoco_val2014_annotations.json
│ ├── v2_OpenEnded_mscoco_train2014_questions.json
│ ├── v2_OpenEnded_mscoco_val2014_questions.json
│ ├── counting_images
│ │ ├── train2014
│ │ ├── val2014

Reference: [link]

Q-Bench2

Download Dev data from [here].

dataset
├── q-bench2
│ ├── q-bench2-a1-dev.jsonl
│ ├── llvisionqa_compare_dev

Reference: [link]

Preparing question-answer pairs

All annotated pairs are available under [here]. Concretely,

  • MIT-States: st_label
  • Fashionpedia: fashion_label
  • VAW: vaw_label
  • CUB-200-2011: cub_label
  • Wildfish: Wildfish_label
  • MagicBrush: mb_label
  • Spot-the-diff: spot_difference_label
  • CelebA: celebA_label
  • FER-2013: fer2013_label
  • SoccerNet: soccernet_label
  • CompCars: car_label
  • NYU-Depth V2: depth_label
  • VQAv2: vqav2_label
  • Q-bench2: qbench2_label

Each dataset has one annotated JSON file, which contains a list of dictionaries. Each dictionary represents the annotation for a pair of images.

CompCars, CelebA, CUB-200-2011, NYU-Depth V2, Fashionpedia, FER-2013, MIT-States, VAW and VQAv2 have the following keys in the annotation:

  • image_1: First image
  • image_2: Second image
  • question: Question about relativity between the two images
  • answer: Correct answer. "Right" indicates that the second image is correct and "Left" means that the first image is correct

Note: All questions in CompCars are the same, so it does not have the question key. The question is "Based on these images, which car is newer in terms of its model year or release year?" The term "newer" is related to the year each car was manufactured or released, not its current condition or usage."

Note: MIT-States and VAW have an additional key type, which can be 'Size', 'Color', 'Pattern', 'Texture', 'Shape', or 'State'. The 'Size', 'Color', 'Pattern', 'Texture', and 'Shape' are common visual attributes.

MagicBrush and Spot-the-diff consist of multi-choice questions where models need to select one of the provided options.

  • image_1: First image
  • image_2: Second image
  • options: Options related to images
  • answer: Correct answer

Question for MagicBrush and Spot-the-diff: "What is the most obvious difference between two images? Choose from the following options. If there is no obvious difference, choose None. Options: None,{pair['options']}. Please only return one of the options without any other words."

MagicBrush has additional keys:

  • image_1_caption: Caption for the first image
  • image_2_caption: Caption for the second image
  • CLIP_similarity: CLIP similarity between the two images

Q-bench2 contains multi-choice questions that models need to select one of the provided options. The authors of Q-bench2 have already combined two images into one single image.

  • image: Combined image
  • question: Question about relativity between the two images
  • options: Options related to images
  • answer: Correct answer

SoccerNet has the following keys:

  • image_1: First frame
  • image_2: Second frame
  • answer: Correct answer. "Right" indicates that the second frame is correct and "Left" means that the first frame is correct
  • CLIP_similarity: CLIP similarity between the two frames
  • action: Soccer action related to the two frames
  • match: Soccer match related to the two frames
  • league: Soccer League related to the two frames

Question: "These are two frames related to {pair['action']} in a soccer match. Which frame happens first? Please only return one option from (Left, Right, None) without any other words. If these two frames are exactly the same, select None. Otherwise, choose Left if the first frame happens first and select Right if the second frame happens first."

Citation

If you find the code and data useful, please cite the following paper:

@article{kil2024compbench,
  title={CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs},
  author={Kil, Jihyung and Mai, Zheda and Lee, Justin and Wang, Zihe and Cheng, Kerrie and Wang, Lemeng and Liu, Ye and Chowdhury, Arpita and Chao, Wei-Lun},
  journal={arXiv preprint arXiv:2407.16837},
  year={2024}
}

About

CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.

https://compbench.github.io/

License:Other


Languages

Language:Jupyter Notebook 93.0%Language:Python 4.2%Language:MATLAB 1.4%Language:Lua 0.5%Language:C++ 0.5%Language:C 0.3%Language:Cython 0.2%Language:Makefile 0.0%