GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

This code implements the prediction of visual scanpath along with its corresponding natural language explanations in three different tasks (3 different datasets) with two different architecture:

Free-viewing: the prediction of scanpath for looking at some salient or important object in the given image. (OSIE)
Visual Question Answering: the prediction of scanpath during human performing general tasks, e.g., visual question answering, to reflect their attending and reasoning processes. (AiR-D)
Visual search: the prediction of scanpath during the search of the given target object to reflect the goal-directed behavior under target present and absent conditions. (COCO-Search18 Target-Present and Target-Absent)

🔥 News

[2024/07] GazeXplain code and datasets initially released.

📣 Overview

We introduce GazeXplain, a novel scanpath explanation task to understand human visual attention. We provide ground-truth explanations on various eye-tracking datasets and develop a model architecture for predicting scanpaths and generating natural language explanations. This example reveals how observers strategically investigate a scene to find out if the person is walking on the sidewalk. Fixations (circles) start centrally, locating a driving car, then shifting to the sidewalk to find the person, and finally looking down to confirm if the person is walking. By annotating observers' scanpaths with detailed explanations, we enable a deeper understanding of the what and why behind fixations, providing insights into human decision-making and task performance.

🙇‍♂️ Disclaimer

For the ScanMatch evaluation metric, we adopt the part of GazeParser package. We adopt the implementation of SED and STDE from VAME as two of our evaluation metrics mentioned in the Visual Attention Models. More specific, we adopt the evaluation metrics provided in Scanpath and Gazeformer, respectively. Based on the checkpoint implementation from updown-baseline, we slightly modify it to accommodate our pipeline.

✅ Requirements

Python 3.10
PyTorch 2.1.2 (along with torchvision)
We also provide the conda environment environment.yml, you can directly run

$ conda env create -f environment.yml

to create the same environment where we successfully run our codes.

📑 Datasets

Our GazeXplain dataset is released! You can download the dataset from Link. This dataset contains the explanations of visual scanpaths in three different scanpath datasets (OSIE, AiR-D, COCO-Search18).

💻 Preprocess

To process the data, you can follow the instructions provided in Scanpath and Gazeformer. For handling the SS cluster, you can refer to Gazeformer and Target-absent-Human-Attention. More specifically, you can run the following scripts to process the data.

$ python ./src/preprocess/${dataset}/preprocess_fixations.py

$ python ./src/preprocess/${dataset}/feature_extractor.py

We structure <dataset_root> as follows

🏃 Training your own network on ALL the datasets

We set all the corresponding hyper-parameters in opt.py.

The train_explanation_alignment.py script will dump checkpoints into the folder specified by --log_root (default = ./runs/). You can also set the other hyper-parameters in opt.py or define them in the bash/train.sh.

--datasets Folder to the dataset, e.g., <dataset_root>.
--epoch The number of total epochs.
--start_rl_epoch Start to use reinforcement learning at the predefined epoch.

You can also use the following commands to train your own network. Then you can run the following commands to evaluate the performance of your trained model on test split.

$ sh bash/train.sh

🚅 Evaluate on test split

For inference, we provide the pretrained model, and you can directly run the following command to evaluate the performance of the pretrained model on test split.

$ sh bash/test.sh

✒️ Citation

If you use our code or data, please cite our paper:

@inproceedings{xianyu:2024:gazexplain,
    Author         = {Xianyu Chen and Ming Jiang and Qi Zhao},
    Title          = {GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths},
    booktitle      = {Proceedings of the European Conference on Computer Vision (ECCV)},
    Year           = {2024}
}

chenxy99 / GazeXplain