E2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

(👉Under construction! You can currently check command.txt for commands. There are several redundancies in the current version, and the commands/instructions are not perfectly ready for formal release. I will gradually update it! Please stay tuned.)

Our arxiv version is currently available. Please check it out! 🔥🔥🔥

This repository contains the official PyTorch implementation for E2VPT: An Effective and Efficient Approach for Visual Prompt Tuning. Our work is based on Visual Prompt Tuning VPT, and we thank the great work of them.

As the size of transformer-based models continues to grow, fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g.,, 0.32% of model parameters on VTAB-1k). We anticipate that this work will inspire further exploration within the pretrain-then-finetune paradigm for large-scale models.

Figure 1: Overview of our E2VPT framework. Under the pretrain-then-finetune paradigm, only the prompts in the transformer's input and backbone, are updated during the fine-tuning process, while all other components remain frozen. We further introduce pruning at two levels of granularity (i.e., token-wise and segment-wise) in (d) to eliminate unfavorable input prompts during rewinding.

Environment settings

See env_setup.sh

Note that you need to add a file (which is put in timm_added folder) to timm/models with path anaconda3/envs/[envs-name]/lib/python3.7/site-packages/timm/models, and init it in __init__.py by adding from .vision_transformer_changeVK import *.

Experiments

Key configs:

E^2VPT related:
- MODEL.P_VK.NUM_TOKENS: prompt length on Value-Key pair
- MODEL.P_VK.NUM_TOKENS_P: prompt length (similar to VPT, but with pruning and rewinding)
Fine-tuning method specification ("P_VK" as default method for E^2VPT):
- MODEL.TRANSFER_TYPE
Vision backbones:
- DATA.FEATURE: specify which representation to use
- MODEL.TYPE: the general backbone type, e.g., "vit" or "swin"
- MODEL.MODEL_ROOT: folder with pre-trained model checkpoints
Optimization related:
- SOLVER.BASE_LR: learning rate for the experiment
- SOLVER.WEIGHT_DECAY: weight decay value for the experiment
- DATA.BATCH_SIZE
Datasets related:
- DATA.NAME
- DATA.DATAPATH: where you put the datasets
- DATA.NUMBER_CLASSES
Others:
- OUTPUT_DIR: output dir of the final model and logs

Datasets preperation:

As I am having a hard time preparing for all of the datasets, I am considering to release a compiled version of FGVC and VTAB-1k sooner or later. For now, you can follow the instructions in VPT for more details. We strictly follow the same datasets setup as VPT.

Pre-trained model preperation

Download and place the pre-trained Transformer-based backbones to MODEL.MODEL_ROOT. Note that you also need to rename the downloaded ViT-B/16 ckpt from ViT-B_16.npz to imagenet21k_ViT-B_16.npz.

See Table 9 in the Appendix for more details about pre-trained backbones.

Pre-trained Backbone	Pre-trained Objective	Link	md5sum
ViT-B/16	Supervised	link	`d9715d`
ViT-B/16	MoCo v3	link	`8f39ce`
ViT-B/16	MAE	link	`8cad7c`
Swin-B	Supervised	link	`bf9cc1`

Hyperparameters for experiments in paper

We will release the hyperparameters for all experiments in the paper soon. Stay tuned!

Citation

If you find our work helpful in your research, please cite it as:

@inproceedings{cheng2023e2vpt,
  title={E2VPT: An Effective and Efficient Approach for Visual Prompt Tuning},
  author={Cheng, Han and Qifan, Wang and Yiming, Cui and Zhiwen, Cao and Wenguan, Wang and Siyuan, Qi and Dongfang, Liu},
  booktitle={International Conference on Computer Vision (ICCV)},
  year={2023}
}

License

The majority of VPT is licensed under the CC-BY-NC 4.0 license (see LICENSE for details). Portions of the project are available under separate license terms: GitHub - google-research/task_adaptation and huggingface/transformers are licensed under the Apache 2.0 license; Swin-Transformer, ConvNeXt and ViT-pytorch are licensed under the MIT license; and MoCo-v3 and MAE are licensed under the Attribution-NonCommercial 4.0 International license.

runtsang / E2VPT