AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval

Official Pytorch code of AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval

📖 Introduce

Text-guided image retrieval integrates reference image and text feedback as a multimodal query to search the image corresponding to user intention. Recent approaches employ multi-level matching, multiple accesses, or multiple subnetworks for better performance regardless of the heavy burden of storage and computation in the deployment. Besides, these models not only rely on expert knowledge to handcraft image-text composing modules but also do inference by the static computational graph. It limits the representation capability and generalization ability of networks in the face of challenges from complex and varied combinations of reference image and text feedback. To break the shackles of the static network concept, we introduce the dynamic router mechanism to achieve data-dependent expert activation and flexible collaboration of multiple experts to explore more implicit multimodal fusion patterns. Specifically, we construct our Adaptive Multi-expert Collaborative network (AMC) by using the proposed router to activate the different experts with different levels of image-text interaction. Since routers can dynamically adjust the activation of experts for the current samples, AMC can achieve the adaptive fusion mode for the different reference image and text combinations and generate dynamic computational graphs according to varied multimodal queries. Extensive experiments on two benchmark datasets demonstrate that benefits from the image-text composing representation produced by adaptive multi-expert collaboration mechanism, AMC has better retrieval performance and zero-shot generalization ability than the state-of-the-art method while keeping the lightweight model and fast retrieval speed.

🔥 Train

Training on the Fashion-IQ dataset

sh ./shell/IQ.sh

Training on the Shoes dataset

sh ./shell/shoes.sh

❄️ Evaluation

Evaluation of the Fashion-IQ and Shoes dataset.

sh ./shell/eval_IQ.sh

Evaluation of the ensemble model.

sh ./shell/eval_ensemble.sh

🏷️ Note: When training or evaluating, you need to modify the default data and model path to yourself path.

🔧 Setup and Environments

Python: 3.6
Pytorch: 1.7.1
RTX 3090
Ubuntu 14.04.6 LTS

Install packages:

pip install -r requirements.txt

📁 Dataset

Download the Fashion-IQ dataset by following the instructions on XiaoxiaoGuo. follow the XiaoxiaoGuo and CLVC-NET, we first resize the downloaded images by resize_images.py.

Because more and more download links to Fashion-IQ images are being taken down, you can also use the stored dataset version from the author of CoSMo.

It seems like the raw download link of Shoes dataset cannot be accessed. In order to facilitate the follow-up studies in this field, we have uploaded a version of Shoes dataset in Google Drive. Please be aware that this link is not permanent, and may be taken down in the future. Besides, we don‘t own this dataset and please remember to claim the raw source of this dataset.

📌 Pretrained Model Weight

The pretrained weight are stored in Google Driver. There are two model weights: DCR_sim_0 and DCR_sim_1. They can evaluate the ensemble performance.

🌈 Model Architecture

⚖️ Main Results

Fashion-IQ dataset

Shoes dataset

📝 Citation

If this codebase is useful to you, please cite our work:

@article{zhu2023amc,
  title={AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval},
  author={Zhu, Hongguang and Wei, Yunchao and Zhao, Yao and Zhang, Chunjie and Huang, Shujuan},
  journal={ACM Transactions on Multimedia Computing, Communications and Applications},
  year={2023}
}

🐼 Contacts

If you have any questions, please feel free to contact me: zhuhongguang1103@gmail.com or hongguang@bjtu.edu.cn.

📚 Reference

Lee, Seungmin, Dongwan Kim, and Bohyung Han. "Cosmo: Content-style modulation for image retrieval with text feedback." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
Wen, Haokun, et al. "Comprehensive linguistic-visual composition network for image retrieval." Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021.

KevinLight831 / AMC