Enhanced Semantic Similarity Learning Framework for Image-Text Matching

Official PyTorch implementation of the paper Enhanced Semantic Similarity Learning Framework for Image-Text Matching.

Please use the following bib entry to cite this paper if you are using any resources from the repo.

@article{zhang2023enhanced,
  author={Zhang, Kun and Hu, Bo and Zhang, Huatian and Li, Zhe and Mao, Zhendong},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={Enhanced Semantic Similarity Learning Framework for Image-Text Matching}, 
  year={2024},
  volume={34},
  number={4},
  pages={2973-2988}
}

We referred to the implementations of GPO to build up our codebase.

Motivation

Squares denote local dimension elements in a feature. Circles denote the measure-unit, i.e., the minimal basic component used to examine semantic similarity. Compared with (a) existing methods typically default to a static mechanism that only examines the single-dimensional cross-modal correspondence, (b) our key idea is to dynamically capture and learn multi-dimensional enhanced correspondence. That is, the number of dimensions constituting the measure-units is changed from existing only one to hierarchical multi-levels, enabling their examining information granularity to be enriched and enhanced to promote a more comprehensive semantic similarity learning.

Introduction

In this paper, different from the single-dimensional correspondence with limited semantic expressive capability, we propose a novel enhanced semantic similarity learning (ESL), which generalizes both measure-units and their correspondences into a dynamic learnable framework to examine the multi-dimensional enhanced correspondence between visual and textual features. Specifically, we first devise the intra-modal multi-dimensional aggregators with iterative enhancing mechanism, which dynamically captures new measure-units integrated by hierarchical multi-dimensions, producing diverse semantic combinatorial expressive capabilities to provide richer and discriminative information for similarity examination. Then, we devise the inter-modal enhanced correspondence learning with sparse contribution degrees, which comprehensively and efficiently determines the cross-modal semantic similarity. Extensive experiments verify its superiority in achieving state-of-the-art performance.

Image-text Matching Results

The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using BERT as the text backbone. Some results are better than those reported in the paper. However, it should be noted that the ensemble results in the paper may not be obtained by the best two checkpoints provided. It is lost due to not saving in time. You can train the model several times more and then combine any two to find the best ensemble performance. Please check out to the CLIP-based branch for the code and pre-trained models.

Results of 5-fold evaluation on COCO 1K Test Splitbe

	Visual Backbone	Text Backbone	R1	R5	R10	R1	R5	R10	Rsum	Link
ESL-H	BUTD region	BERT-base	82.5	97.4	99.0	66.2	91.9	96.7	533.5	Here
ESL-A	BUTD region	BERT-base	82.2	96.9	98.9	66.5	92.1	96.7	533.4	Here

Results of 5-fold evaluation on COCO 5K Test Split

	Visual Backbone	Text Backbone	R1	R5	R10	R1	R5	R10	Rsum	Link
ESL-H	BUTD region	BERT-base	63.6	87.4	93.5	44.2	74.1	84.0	446.9	Here
ESL-A	BUTD region	BERT-base	63.0	87.6	93.3	44.5	74.4	84.1	447.0	Here

Results on Flickr30K Test Split

	Visual Backbone	Text Backbone	R1	R5	R10	R1	R5	R10	Rsum	Link
ESL-H	BUTD region	BERT-base	83.5	96.3	98.4	65.1	87.6	92.7	523.7	Here
ESL-A	BUTD region	BERT-base	84.3	96.3	98.0	64.1	87.4	92.2	522.4	Here

Preparation

Environment

We recommended the following dependencies.

Python 3.6
PyTorch 1.8.0
NumPy (>1.19.5)
TensorBoard
The specific required environment can be found here Using conda env create -f ESL.yaml to create the corresponding environments.

Data

You can download the dataset through Baidu Cloud. Download links are Flickr30K and MSCOCO, the extraction code is: USTC.

Training

sh  train_region_f30k.sh

sh  train_region_coco.sh

For the dimensional selective mask, we design both heuristic and adaptive strategies. You can use the flag in vse.py (line 44)

heuristic_strategy = False

to control which strategy is selected. True -> heuristic strategy, False -> adaptive strategy.

Evaluation

Test on Flickr30K

python test.py

To do cross-validation on MSCOCO, pass fold5=True with a model trained using --data_name coco_precomp.

python testall.py

To ensemble model, specify the model_path in test_stack.py, and run

python test_stack.py

CrossmodalGroup / ESL