ZhangXu0963 / VSL

The code of "Image-text Retrieval via Preserving Main Semantic of Vision" in ICME 2023.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Visual Semantic Loss (VSL)

The code for ICME2023 paper of “Image-text Retrieval via preserving main Semantics of Vision”[pdf].

We proposed a semantical alignment strategy Visual Semantic Loss(VSL) for image-text retrieval. And we verify the effectiveness on top of two models proposed in SGRAF.

Introduction

The framework of VSL:

The experiments result:

Dataset Method Image to Text Text to Image
R@1R@5R@10 R@1R@5R@10
MSCOCO1K SGR+VSL 78.596.298.6 63.089.995.3
SAF+VSL 78.396.098.6 63.089.995.3
SGRAF+VSL 80.196.598.8 64.890.795.9
MSCOCO5K SGR+VSL 57.784.391.0 41.470.580.8
SAF+VSL 56.284.491.3 41.470.481.0
SGRAF+VSL 60.286.692.5 43.372.282.5
Flickr30K SGR+VSL 75.793.596.5 56.580.985.9
SAF+VSL 75.993.997.5 57.982.788.9
SGRAF+VSL 79.595.397.9 60.284.389.4

Requirements

import nltk
nltk.download()
> d punkt

Download data and vocab

We follow SCAN and SGRAF to obtain image features and vocabularies, which can be downloaded by using:

wget https://iudata.blob.core.windows.net/scan/data.zip
wget https://iudata.blob.core.windows.net/scan/vocab.zip

Another download link is provided by SGRAF.

https://drive.google.com/drive/u/0/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqC

Pre-trained models and evaluation

Put the pretrained models into "./checkpoint".

1. The evaluation for pre-trained SGR+VSL and SAF+VSL models.

Modify the model_path, data_path, vocab_path in the eval_single.py file. Then run eval_single.py:

For example:

    evalrank(model_path="./checkpoint/SGR+VSL_COCO.pth.tar", data_path='./data', split="testall", fold5=True)
(For SGR+VSL and SAF+VSL) python eval_single.py

Note that fold5=True is only for evaluation on mscoco1K (5 folders average) while fold5=False for mscoco5K and flickr30K. Pretrained models and Log files can be downloaded from:

2. The evaluation for pre-trained SGRAF+VSL model.

Modify the sgr_model_path, saf_model_path, data_path, vocab_path in the eval_overall.py file. Then run eval_overall.py:

For example:

    evalrank(sgr_model_path="./checkpoint/SGR+VSL_COCO.pth.tar", saf_model_path="./checkpoint/SAF+VSL_COCO.pth.tar", data_path='./data', split="testall", fold5=True)
(For SGRAF+VSL) python eval_overall.py

Note that fold5=True is only for evaluation on mscoco1K (5 folders average) while fold5=False for mscoco5K and flickr30K. Pretrained models and Log files can be downloaded from:

Training new models

Modify the data_path, vocab_path, model_name, logger_name in the opts.py file. Then run train.py:

For MSCOCO:

(For SGR+VSL) python train.py --data_name coco_precomp --batch_size 128 --num_epochs 25 --lr_update 10 --learning_rate 0.0003 --module_name SGR

(For SAF+VSL) python train.py --data_name coco_precomp --batch_size 128 --num_epochs 25 --lr_update 10 --learning_rate 0.0003 --module_name SAF

For Flickr30K:

(For SGR+VSL) python train.py --data_name f30k_precomp --batch_size 128 --num_epochs 40 --lr_update 25 --learning_rate 0.0003 --module_name SGR

(For SAF+VSL) python train.py --data_name f30k_precomp --batch_size 128 --num_epochs 30 --lr_update 15 --learning_rate 0.0003 --module_name SAF

Ablation

1. Ablation study for Data Diversity

Modify the --batch_size to 32, 64, and 128. The results on MSCOCO1K shows below.

2. Ablation study for Semantic similarity within the visual and textual modality.

Modify the code in line 501-530, model.py. The results on MSCOCO5K shows below.

Reference

If Visual Semantic Loss(VSL) is useful for you, please cite the following paper.

Since ICME2023 has published the paper, please cite this official version of the paper. : )

@inproceedings{10219570,
author={Zhang, Xu and Niu, Xinzheng and Fournier-Viger, Philippe and Dai, Xudong},
booktitle={2023 IEEE International Conference on Multimedia and Expo (ICME)},
title={Image-text Retrieval via Preserving Main Semantics of Vision},
year={2023},
pages={1967-1972},
doi={10.1109/ICME55011.2023.00337}
}

About

The code of "Image-text Retrieval via Preserving Main Semantic of Vision" in ICME 2023.


Languages

Language:Python 100.0%