wjpoom / SPEC

[CVPR 2024] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"

Home Page:https://arxiv.org/abs/2312.00081

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

πŸ“‘Paper πŸ“Data πŸ“™Notebook βœ’οΈBibTex πŸš€Preview

Authors: Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu

πŸ”₯ News

  • Apr. 14, 2024 We have released a preview of a more advanced dataset version, the full version will come soon.
  • Apr. 13, 2024 We released the SPEC dataset and the code for evaluation, sorry for the delay ☺️.
  • Feb. 28, 2024 Our work has been accepted by CVPR 2024 πŸŽ‰.

πŸš€ A more advanced version is coming!

We are building a new version with a larger data scale, more object categories, and higher-quality images and text, and more. You can preview it at this website, and the full version will come soon.

πŸ” SPEC Benchmark

To evaluate the understanding capability of visual-language models on fine-grained concepts, we propose a new benchmark, SPEC, which consists of six distinct subsets, distributed across the dimensions of Size, Position, Existence, and Count. Each test case consists of an image candidate set, which differs only in certain visual concepts, and a text candidate set, which differs only in the corresponding language concept.

πŸ”§ Usage

install

git clone https://github.com/wjpoom/SPEC.git
cd SPEC/
pip install -e .

prepare data

  • run the following code in Python shell, replace /path/to/save/data with a specified dir to store the data.
import zipfile
import os
from huggingface_hub import hf_hub_download

data_root = '/path/to/save/data'
hf_hub_download(repo_id='wjpoom/SPEC', repo_type='dataset', filename='data.zip', local_dir=data_root)

with zipfile.ZipFile(os.path.join(data_root, 'data.zip'), 'r') as zip_ref:
    zip_ref.extractall(os.path.join(data_root))
    
os.remove(os.path.join(data_root, 'data.zip'))

explore the dataset

  • We provide a πŸ““notebook that enables you to visually explore the test samples in the SPEC dataset.
  • Run this notebook either locally or online using Colab.

reproduce the results

  • In our paper, we evaluated four popular VLMs using our SPEC dataset, namely: CLIP, BLIP, FLAVA and CoCa.
  • To reproduce the results with these VLMs, you can run this script.
  • You can also reproduce with this local notebook or the online Colab notebook.

evaluate custom VLMs

  • If you want to evaluate your custom model on SPEC, you can follow the instructions in this document.

πŸ“ TODO

  • Release the newly built version of the dataset
  • Release the code of our data synthesize pipeline
  • Release the testing set of SPEC benchmark
  • Release the evaluation code of SPEC

πŸ‘ Acknowledgement

Part of this repository is built upon ARO, thanks for the well-organized codebase.

βœ’οΈ Citation

If you use our code or data in this repo or find our work helpful, please consider giving a citation:

@inproceedings{peng2024spec,
  title={Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding},
  author={Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu}, 
  booktitle={CVPR},
  year={2024}
}

About

[CVPR 2024] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"

https://arxiv.org/abs/2312.00081


Languages

Language:Jupyter Notebook 98.9%Language:Python 1.1%Language:Shell 0.0%