Patch-level Representation Learning for Self-supervised Vision Transformers (SelfPatch)

PyTorch implementation for "Patch-level Representation Learning for Self-supervised Vision Transformers" (accepted Oral presentation in CVPR 2022)

Requirements

torch==1.7.0
torchvision==0.8.1

Pretraining on ImageNet

python -m torch.distributed.launch --nproc_per_node=8 main_selfpatch.py --arch vit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir --local_crops_number 8 --patch_size 16 --batch_size_per_gpu 128 --out_dim_selfpatch 4096 --k_num 4

Pretrained weights on ImageNet

You can download the weights of the pretrained models on ImageNet. All models are trained on ViT-S/16. For detection and segmentation downstream tasks, please check SelfPatch/detection, SelfPatch/segmentation.

backbone	arch	checkpoint
DINO	ViT-S/16	download (pretrained model from VISSL)
DINO + SelfPatch	ViT-S/16	download

Evaluating video object segmentation on the DAVIS 2017 dataset

Step 1. Prepare DAVIS 2017 data

cd $HOME
git clone https://github.com/davisvideochallenge/davis-2017
cd davis-2017
./data/get_davis.sh

Step 2. Run Video object segmentation

python eval_video_segmentation.py --data_path /path/to/davis-2017/DAVIS/ --output_dir /path/to/saving_dir --pretrained_weights /path/to/model_dir --arch vit_small --patch_size 16

Step 3. Evaluate the obtained segmentation

git clone https://github.com/davisvideochallenge/davis2017-evaluation 
$HOME/davis2017-evaluation
python /path/to/davis2017-evaluation/evaluation_method.py --task semi-supervised --davis_path /path/to/davis-2017/DAVIS --results_path /path/to/saving_dir

Video object segmentation examples on the DAVIS 2017 dataset

Video (left), DINO (middle) and our SelfPatch (right)

Acknowledgement

Our code base is built partly upon the packages: DINO, mmdetection, mmsegmentation and XCiT

Citation

If you use this code for your research, please cite our papers.

@InProceedings{Yun_2022_CVPR,
    author    = {Yun, Sukmin and Lee, Hankook and Kim, Jaehyung and Shin, Jinwoo},
    title     = {Patch-Level Representation Learning for Self-Supervised Vision Transformers},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {8354-8363}
}

YoojLee / SelfPatch