deit image image-retrieval vision-transformers

Training Vision Transformers for Image Retrieval

(Unofficial) PyTorch implementation of Training Vision Transformers for Image Retrieval(El-Nouby, Alaaeldin, et al. 2021).
I have not yet achieved exactly the same results as reported in the paper(Differential entropy regularization does not have much effect on In-shop and SOP datasets).

Requirements

# Python 3.7
pip install -r requirements.txt

Training

See scripts/train.*.sh

CUB-200-2011

# CUB-200-2011
python main.py \
  --model deit_small_distilled_patch16_224 \
  --max-iter 2000 \
  --dataset cub200 \
  --data-path /data/CUB_200_2011 \
  --rank 1 2 4 8 \
  --lambda-reg 0.7

Stanford Online Products

# Stanford Online Products
python main.py \
  --model deit_small_distilled_patch16_224 \
  --max-iter 35000 \
  --dataset sop \
  --m 2 \
  --data-path /data/Stanford_Online_Products \
  --rank 1 10 100 1000 \
  --lambda-reg 0.7

In-shop

# In-shop
python main.py \
  --model deit_small_distilled_patch16_224 \
  --max-iter 35000 \
  --dataset inshop \
  --data-path /data/In-shop \
  --m 2 \
  --rank 1 10 20 30 \
  --memory-ratio 0.2 \
  --device cuda:2 \
  --encoder-momentum 0.999 \
  --lambda-reg 0.7

Experiments

IRT_O – off-the-shelf extraction of features from a ViT backbone, pre-trained on ImageNet;

IRT_L – fine-tuning a transformer with metric learning, in particular with a contrastive loss;

IRT_R – additionally regularizing the output feature space to encourage uniformity.

†: Models pre-trained with distillation with a convnet trained on ImageNet1k

Method	Backbone	SOP				CUB-200				In-Shop
Method	Backbone	1	10	100	1000	1	2	4	8	1	10	20	30
IRT_O	DeiT-S	53.12	68.96	81.60	94.09	58.68	71.30	80.96	88.18	31.28	57.03	64.20	68.28
IRT_L	DeiT-S	83.56	93.29	97.23	99.03	73.68	82.58	88.77	92.71	93.09	98.28	98.74	99.02
IRT_R	DeiT-S	82.67	92.73	96.69	98.80	73.73	82.91	89.30	93.35	90.47	97.97	98.61	98.92
IRT_R	DeiT-S†	82.70	92.85	96.92	98.86	76.55	85.26	90.92	94.65	90.66	98.16	98.68	98.99

References

El-Nouby, Alaaeldin, et al. "Training vision transformers for image retrieval." arXiv preprint arXiv:2102.05644 (2021).

About

(Unofficial) PyTorch implementation of Training Vision Transformers for Image Retrieval(El-Nouby, Alaaeldin, et al. 2021).

deit image image-retrieval vision-transformers

Languages

Language:Python 88.8%Language:Shell 11.2%