The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False). Below are a few examples.
The cat is behind the laptop. (True) | The cow is ahead of the person. (False) | The cake is at the edge of the dining table. (True) | The horse is left of the person. (False) |
---|---|---|---|
Understanding spatial relations is fundamental to achieve intelligence. Existing vision-language reasoning datasets are great but they compose multiple types of challenges and can thus conflate different sources of error. The VSR corpus focuses specifically on spatial relations so we can have accurate diagnosis and maximum interpretability.
Below are baselines' by-relation performances on VSR (random split). More data != better performance. The relations are sorted by frequencies from left to right. The VLMs' by-relation performances have little correlation with relation frequency, meaning that more training data do not necessarily lead to better performance.
Understanding object orientation is hard. After classifying spatial relations into meta-categories, we can clearly see that all models are at chance level for "orientation"-related relations (such as "facing", "facing away from", "parallel to", etc.).
For more findings and takeways including zero-shot split performance. check out our paper!
The VSR corpus, after validation, contains 10,119 data points with high agreement. On top of these, we create two splits (1) random split and (2) zero-shot split. For random split, we randomly split all data points into train, development, and test sets. Zero-shot split makes sure that train, development and test sets have no overlap of concepts (i.e., if dog is in test set, it is not used for training and development). Below are some basic statistics of the two splits.
split | train | dev | test | total |
---|---|---|---|---|
random | 7,083 | 1,012 | 2,024 | 10,119 |
zero-shot | 5,440 | 259 | 731 | 6,430 |
Check out data/
for more details.
We test three baselines, all supported in huggingface. They are VisualBERT (Li et al. 2019), LXMERT (Tan and Bansal, 2019) and ViLT (Kim et al. 2021).
model | random split | zero-shot |
---|---|---|
human | 95.4 | 95.4 |
VisualBERT | 57.4 | 54.0 |
LXMERT | 72.5 | 63.2 |
ViLT | 71.0 | 62.4 |
See data/
folder's readme. Images should be saved under data/images/
.
Depending on your system configuration and CUDA version, you might need two sets of environment: one environment for feature extraction (i.e, "Extract visual embeddings" section below) and one environment for all other experiments. You can install feature extraction environment by running feature_extraction/feature_extraction_environment.sh
(specifically, feature extraction requires detectron2==0.5, CUDA==11.1 and torch==1.8). The default configuration for running other things can be found in requirements.txt
.
For VisualBERT and LXMERT, we need to first extract visual embeddings using pre-trained object detectors. This can be done through
bash feature_extraction/lxmert/extract.sh
VisualBERT feature extraction is done similarly by replacing lxmert
with visualbert
. The features will be stored under data/features/
and automatically loaded when running training and evaluation scripts of LXMERT and VisualBERT. The feature extraction codes are modified from huggingface examples here (for VisualBERT) and here (for LXMERT).
scripts/
contain some example bash scripts for training and evaluation. For example, the following script trains LXMERT on the random split:
bash scripts/lxmert_train.sh 0
where 0
denotes device index. Configurations such as checkpoint saving address can be modified in the script.
Similarly, evaluating the obtained LXMERT model can be done by running:
bash scripts/lxmert_eval.sh 0
Configurations such as checkpoint reading address can be modified in the script.
In analysis_scripts/
you can checkout how to print out by-relation and by-meta-category accuracies.
If you find VSR useful:
@article{Liu2022VisualSR,
title={Visual Spatial Reasoning},
author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
journal={ArXiv},
year={2022},
volume={abs/2205.00363}
}
This project is licensed under the Apache-2.0 License.