Breaking the Frame: Image Retrieval by Visual Overlap Prediction

Summary

The proposed method enables the identification of visible image sections without requiring expensive feature detection and matching. By focusing on obtaining patch-level embeddings by Vision Transformer backbone and establishing patch-to-patch correspondences, our approach uses a voting mechanism to assess overlap scores for potential database images, thereby providing a nuanced image retrieval metric in challenging scenarios.

Installation

torch == 2.2.2
Python == 3.10.13
OmegaConf == 2.3.0
h5py == 3.11.0
tqdm == 4.66.2
faiss == 1.8.0

Demo

Here we visualize the patch matches on one example image pair found by the trained encoder, VOP. No preparation needed, the images and the model will be downloaded automatically, try it! Feel free to change the image paths to play with your own data.

Evaluation

Step 1. dump the image pairs, save the GT information (e.g., R, K), pretrained Dino V2 [CLS] tokens and patch embeddings (e.g., large in 1024 dim).

Step 2. load the trained encoders to build our own embeddings, eg, in 256-dim, run the retrieval process (CLS tokens for prefiltering, and VOP for reranking.) and save the retrieved image pair list.

Step 3. verify the retrieved image pairs by sending them for relative pose estimation or hloc for localization.

Here are the instructions for testing each data used in our paper and how to test your own data.

💥 important: before data dumping, create/update an original dirs for the specific dataset in dump_datasets/data_dirs.yaml.

dataset_dirs:
  inloc:<src_path>

[Inloc]

download the cutouts (db images) and format the data to database/cutouts/; download the query images into query/iphone7/.
dump the data and perform image retrieval to get the most overlapping image list. (top-40 on InLoc)

python dump_data.py -ds inloc
python retrieve.py -ds inloc -k 40 -m 09 -v 3 -r 0.3 -pre 100 -cls 1

install and run hloc to localize the query images.

python inloc_localization.py --loc_pairs outputs/inloc/09/cls_100/top40_overlap_pairs_w_auc.txt -m 09 -ds inloc

submit the result poses to the long-term visual localization benchmark.

[Megadepth]

download the data from glue-factory: images, scene_info.
dump the data and perform image retrieval to get the most overlapping image list.

python dump_data.py -ds megadepth
python register.py -k 5 -m 09 -v 4 -r 0.2 -pre 20 -cls -ds megadepth

run RANSAC on those pairs to estimate relative poses.

python relative_pose.py -k 5 -m 09 -v 4 -r 0.2 -pre 20 -cls -ds megadepth

optional tests: recall@1, 5, 10.

python recall.py -k 5 -m 09 -v 4 -r 0.2 -pre 20 -cls -ds megadepth

Note: use -v 4 -r 0.2 for recall@10; -v 0 -r 0.01 for recall@1.

[ETH3D]

download ETH3D data (5.6G).
dump the data and perform image retrieval to get the most overlapping image list.

python dump_data.py -ds eth3d
python register.py -k 5 -m 09 -v 3 -r 0.3 -pre 20 -cls -ds eth3d

run RANSAC on those pairs to estimate relative poses.

python relative_pose.py -k 5 -m 09 -v 3 -r 0.3 -pre 20 -cls -ds eth3d

[Your own data]

specify the data dir of your data in data_dirs.yaml, and put the dump script into here to load the images, scene information (K, pose, etc.), and query and data base image lists if needed.
run retrieve.py to retrieve the queries if there are query and db images split; while register.py is the case we retrieve each image in the pool from the rest.
run relative_pose.py for relative pose estimation; or inloc_localization.py to localize the queries by the retrieved db images.

Training

download depths of Megadepth to build the training supervision from here.
customize the configs and start training.

python -m gluefactory.train 09 --conf train_configs/09.yaml

Here the training is based on glue-factory, we provide details of the configurations we focus on.

data:
    # choose the data augmentation type: 'flip, dark, lighglue'
    photometric: {
            "name": "flip",
           "p": 0.95,
            # 'difficulty': 1.0,  # currently unused
       }

model:
    matcher:
        name: overlap_predictor # our model
        add_voting_head: true # whether to train by the constastive loss on the patch-level negative/positive matches
        add_cls_tokens: false # whether to train the global embeddings
        attentions: false # whether to use the attentionsfor supervison
        input_dim: 1024 # the dimension of the pretrained Dino features

train:
  dropout_prob: 0.5    # dropout probability

Notes

[Useful configs]

--radius, radius for radius knn search
--cls, default=False, action True, whether to use CLS tokens as prefilter
--pre_filter, default=20, the number of db images prefiltered for reranking.
--weighted, default=True, action True, whether to use TF-IDF weights for voting scores.
--vote, vote methods.
--k, top-k retrievals.
--overwrite, default=False, action True, overwrite the dumped data, retrieved image list or relative pose, etc.
--num_workers, default=8, change it to fit your machine.

[Acknowledgement]

glue-factory long-term visual localization benchmark pre-commit

[Contact]

Contact me at weitongln@gmail.com or weitong@fel.cvut.cz

weitong8591 / vop