meiyor/DeepGaze-Text-Embedding-Map

DeepGaze adding Text-Embedding-Map features

This repository includes the implementation of DeepGaze adding the Text-Embedding-Maps (TEM) Barman et., al 2020 - Yang et., al 2017 for predicting robustly human fixation/gaze.

We used a data intersection between COCO and SALICON to perform our evaluations using the panoptic annotations/segmentations from COCO and the fixations from SALICON. SALICON saves the resulting fixation from a COCO subgroup in train, val, and test folders.

First install all the dependencies running the following command, please be sure you have pip installed and updated on your bash entry before run this:

pip install requirements.txt
pip install torch===1.6.0 torchvision===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install git+https://github.com/cocodataset/panopticapi.git
pip install -U mittens
pip install random

Subsequently, download the image data from the following links:

COCO images

Download the SALICON fixations in the LSUN challenge webpage here:

SALICON fixations

Please download all the train, val, and test fixation if you would prefer to run the whole DeepGaze+TEM experiments in the whole SALICON dataset.

If you don't want to create the fixation hdf5 file manually, you can use fixation data from hdf5 files included in the experiments_root/ folder

and download the COCO panoptic segmentations from here, unzip them too - use the train and val panoptic sets released on 2017 and 2020:

COCO download**

unzip the file and allocate the folders in the right places you will use it for run the code and modifify from the code if youn need it

unzip COCO_subfolder_output.zip

First generate the TEM, take into account that the file if you want to used pre-trained word-embeddings file_annotations/sal_ground_truth_emb_SALICON_TEM_w.txt structures file_annotations/sal_ground_truth_emb_ADE20K_all_image_co_occur.txt. If you want to generate the embedding from scratch and first define your new training folder before run - refer to the code to see the specific details in generate_objects_co_occur.py:

python generate_TEM/generate_objects_co_occur.py

This will generate a new co-occurrences matrix in a file called sal_cooccur_mat_new.txt. To obtain the new embeddings use the instructions on the package Mittens and upload the file sal_cooccur_mat_new.txt as csv. Don't use the default files they are old files from other object set with less scenes. Please run generate_objects_co_occur.py from scratch to generate your custom co-occurrence matrix.

Having your embedding file you can create your TEM images assigning an output folder for them and running

python generate_TEM/generate_TEM_train.py
python generate_TEM/generate_TEM_val.py

The previous calls will generate the TEM images on .tiff format including the 300 dimension of the semantic spaces obtained by Mittens, and substracting the semantic space of the annotated scene and each object in the scene.

If you want to do the same but creating some TEM images with the Cosine, Euclidean, and Chebyshev (for instance) distances between the annotated scene and the objectes run:

python generate_TEM/generate_TEM_train_dist.py
python generate_TEM/generate_TEM_val_dist.py

Now you must create the centerbias files for the stimuli and the TEM images. For doing that be sure the stimuli_train.hdf5, fixations_train.hdf5, stimuli_val.hdf5, fixations_val.hdf5, stimuli_TEM_train.hdf5, and stimuli_TEM_val.hdf5 files are located in the experiment_root/ folder. If you want to create your own sitmuli and fixations hdf5 files feel free to modify the create_stimuli.py and create_fixations.py files. Subsequently, you can run:

python create_centerbias.py
python create_centerbias_TEM.py

Now, you are ready to do the training! and you can run the following command. Check the configuration file .yaml, in this case the config_dgIII_TEM.yaml file before executing the training/test as follows:

python run_dgIII_evaluation_plus_TEM.py

While you run the training and the test, or after run_dgIII_evaluation_plus_TEM.py is done executing, a log.txt files will present you the current status of the learning, or if you want to wait until the end of the training the file results_TEM.csv will show you the final Log-likehood (LL), Information Gain (IG), Area Under Curve (AUC), and Normalized Scanpath Saliency (NSS).

A LL evolution through the training epochs could be observed in the following Figures including the TEM features. This is the full pipeline of our new approach DeepGaze+TEM. An AdaBound optimizer and a final Drop-out layer (before the Finalizer) must be added to the network for avoiding overfitting. The full pipeline of out semantic-based gaze prediction is shown in the following Figure:

The performance comparison between the DeepGaze (DG) baseline and our DeepGaze+TEM approach is the following. Use the code on matlab_metrics directory to compute the overall results based on the resulting .csv file obtained after training and testing using run_dgIII_evaluation_plus_TEM.py:

	IG	LL	AUC	NSS
DeepGaze	0.5414±0.5714	1.2296±0.5714	0.8317±0.0562	1.5268±0.7245
DeepGaze + TEM	0.5662±0.5816	1.2556±0.5816	0.8333±0.0563	1.5445±0.7661

A statistical comparison have been using paired tests, evaluating the normality of each metrics and finding the following effect plot - correcting p-values. All metrics show a significant improvement for all the paired test as we shown in the following table. The methods on * are less affected by the sample normality, for instance the signrank tests.

	IG	LL	NSS	AUC
ttest	2.25E-12	2.25E-12	1.72E-5	5.41E-7
signtest*	5.56E-12	5.56E-12	1.33E-9	1.41E-5
signrank*	3.03E-14	3.03E-14	2.94E-9	7.88E-8

A great example of the metrics example using the TEM as features is shown in the next figure. This example describes the saliency map pattern using a jet colormap, thus showing the fixations groundtruth, the panoptic annotation image, and the results for the DG baseline and our proposed DG+TEM showing better results for the DG+TEM approach. The panoptic groundtruth image can be obtained from the COCO dataset or the panoptic/semantic scene segmentation networks proposed by the remarkable paper of Zhou et., al 2017. You can see with the DG+TEM approach the contour of the person and the direction of the movement of the racket is more followed by the entire system in comparison with DG. The person object is then not co-occurring so much in court scenes.

In this example the DG+TEM approach follows more the cake and the baby because they represent fewer co-occurrences in kitchen scenes, and remove percentage of saliency heatmaps to the stove, oven and even the fridge in the right side in comparison with DG only. We can state that DG+TEM works for particular cases in SALICON where appears an object or multiple objects (annotated) that represent semantic incoherency or a modulation in the bottom-up attention.

meiyor / DeepGaze-Text-Embedding-Map

DeepGaze adding Text-Embedding-Map features

About

Languages