VSR: Visual Spatial Reasoning

A probing benchmark for spatial undersranding of vision-language models.

1 Overview

The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False). Below are a few examples.

The cat is behind the laptop. (True)	The cow is ahead of the person. (False)	The cake is at the edge of the dining table. (True)	The horse is left of the person. (False)

1.1 Why VSR?

Understanding spatial relations is fundamental to achieve intelligence. Existing vision-language reasoning datasets are great but they compose multiple types of challenges and can thus conflate different sources of error. The VSR corpus focuses specifically on spatial relations so we can have accurate diagnosis and maximum interpretability.

1.2 What have we found?

Below are baselines' by-relation performances on VSR (random split). More data != better performance. The relations are sorted by frequencies from left to right. The VLMs' by-relation performances have little correlation with relation frequency, meaning that more training data do not necessarily lead to better performance.

Understanding object orientation is hard. After classifying spatial relations into meta-categories, we can clearly see that all models are at chance level for "orientation"-related relations (such as "facing", "facing away from", "parallel to", etc.).

For more findings and takeways including zero-shot split performance. check out our paper!

2 The VSR dataset: Splits, statistics, and meta-data

The VSR corpus, after validation, contains 10,119 data points with high agreement. On top of these, we create two splits (1) random split and (2) zero-shot split. For random split, we randomly split all data points into train, development, and test sets. Zero-shot split makes sure that train, development and test sets have no overlap of concepts (i.e., if dog is in test set, it is not used for training and development). Below are some basic statistics of the two splits.

split	train	dev	test	total
random	7,083	1,012	2,024	10,119
zero-shot	5,440	259	731	6,430

Check out data/ for more details.

3 Baselines: Performance

We test three baselines, all supported in huggingface. They are VisualBERT (Li et al. 2019), LXMERT (Tan and Bansal, 2019) and ViLT (Kim et al. 2021).

model	random split	zero-shot
human	95.4	95.4
VisualBERT	57.4	54.0
LXMERT	72.5	63.2
ViLT	71.0	62.4

4 Baselines: How to run?

Download images

See data/ folder's readme. Images should be saved under data/images/.

Environment

Depending on your system configuration and CUDA version, you might need two sets of environment: one environment for feature extraction (i.e, "Extract visual embeddings" section below) and one environment for all other experiments. You can install feature extraction environment by running feature_extraction/feature_extraction_environment.sh (specifically, feature extraction requires detectron2==0.5, CUDA==11.1 and torch==1.8). The default configuration for running other things can be found in requirements.txt.

Extract visual embeddings

For VisualBERT and LXMERT, we need to first extract visual embeddings using pre-trained object detectors. This can be done through

bash feature_extraction/lxmert/extract.sh

VisualBERT feature extraction is done similarly by replacing lxmert with visualbert. The features will be stored under data/features/ and automatically loaded when running training and evaluation scripts of LXMERT and VisualBERT. The feature extraction codes are modified from huggingface examples here (for VisualBERT) and here (for LXMERT).

Train

scripts/ contain some example bash scripts for training and evaluation. For example, the following script trains LXMERT on the random split:

bash scripts/lxmert_train.sh 0

where 0 denotes device index. Configurations such as checkpoint saving address can be modified in the script.

Evaluation

Similarly, evaluating the obtained LXMERT model can be done by running:

bash scripts/lxmert_eval.sh 0

Configurations such as checkpoint reading address can be modified in the script.

In analysis_scripts/ you can checkout how to print out by-relation and by-meta-category accuracies.

Citation

If you find VSR useful:

@article{Liu2022VisualSR,
  title={Visual Spatial Reasoning},
  author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.00363}
}

License

This project is licensed under the Apache-2.0 License.

Luvata / visual-spatial-reasoning