Optimal Transport Aggregation for Visual Place Recognition

Sergio Izquierdo, Javier Civera

Code and models for Optimal Transport Aggregation for Visual Place Recognition (DINOv2 SALAD).

Summary

We introduce DINOv2 SALAD, a Visual Place Recognition model that achieves state-of-the-art results on common benchmarks. We introduce two main contributions:

Using a finetuned DINOv2 encoder to get richer and more powerful features.
A new aggregation technique based on optimal transport to create a global descriptor based on optimal transport. This aggregation extends NetVLAD to consider feature-to-cluster relations as well as cluster-to-features. Besides, it includes a dustbin to discard uninformative features.

For more details, check the paper at arXiv.

Setup

It has been tested on Pytorch 2.1.0 with CUDA 12.1 and Xformers. Create a ready to run environment with:

conda env create -f environment.yml

To quickly test and use our model, you can use Torch Hub:

import torch
model = torch.hub.load("serizba/salad", "dinov2_salad")
model.eval()
model.cuda()

Dataset

For training, download GSV-Cities dataset. For evaluation download the desired datasets (MSLS, NordLand, SPED, or Pittsburgh)

Train

Training is done on GSV-Cities for 4 complete epochs. It requires around 30 minutes on an NVIDIA RTX 3090. For training DINOv2 SALAD run:

python3 main.py

After training, logs and checkpoints should be on the logs dir.

Evaluation

You can download a pretrained DINOv2 SALAD model from here. For evaluating run:

python3 eval.py --ckpt_path 'weights/dino_salad.ckpt' --image_size 322 322 --batch_size 256 --val_datasets MSLS Nordland

MSLS Challenge			MSLS Val			NordLand
R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
75.0	88.8	91.3	92.2	96.4	97.0	76.0	89.2	92.0

Acknowledgements

This code is based on the amazing work of:

Cite

Here is the bibtex to cite our paper

@InProceedings{Izquierdo_CVPR_2024_SALAD,
    author    = {Izquierdo, Sergio and Civera, Javier},
    title     = {Optimal Transport Aggregation for Visual Place Recognition},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
}

About

Optimal Transport Aggregation for Visual Place Recognition

GNU General Public License v3.0

Languages

Language:Python 100.0%