Predicting Human Scanpaths in Visual Question Answering

This code implements the prediction of human scanpaths in three different tasks:

Visual Question Answering: the prediction of scanpath during human performing general tasks, e.g., visual question answering, to reflect their attending and reasoning processes.
Free-viewing: the prediction of scanpath for looking at some salient or important object in the given image,
Visual search: the prediction of scanpath during the search of the given target object to reflect the goal-directed behavior.

Reference

If you find the code useful in your research, please consider citing the paper.

@InProceedings{xianyu:2021:scanpath,
    author={Xianyu Chen and Ming Jiang and Qi Zhao},
    title = {Predicting Human Scanpaths in Visual Question Answering},
    booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2021}
}

Disclaimer

For the ScanMatch evaluation metric, we adopt the part of GazeParser package. We adopt the implementation of SED and STDE from VAME as two of our evaluation metrics mentioned in the Visual Attention Models. Based on the checkpoint implementation from updown-baseline, we slightly modify it to accommodate our pipeline.

Requirements

Python 3.7
PyTorch 1.6 (along with torchvision)
We also provide the conda environment sp_baseline.yml, you can directly run

$ conda env create -f sp_baseline.yml

to create the same environment where we successfully run our codes.

Tasks

We provide the corresponding codes for the aforementioned three different tasks on three different datasets.

Visual Question Answering (AiR dataset)
Free-viewing (OSIE dataset)
Visual search (COCO-Search18 dataset)

We would provide more details for these tasks in their corresponding folders.

fabawi / Scanpaths

Predicting Human Scanpaths in Visual Question Answering

Reference

Disclaimer

Requirements

Tasks

About

Languages