(👉Under construction! You can currently check command.txt for commands. There are several redundancies in the current version, and the commands/instructions are not perfectly ready for formal release. I will gradually update it! Please stay tuned.)
Our arxiv version is currently available. Please check it out! 🔥🔥🔥
This repository contains the official PyTorch implementation for E2VPT: An Effective and Efficient Approach for Visual Prompt Tuning. Our work is based on Visual Prompt Tuning VPT, and we thank the great work of them.
As the size of transformer-based models continues to grow, fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g.,, 0.32% of model parameters on VTAB-1k). We anticipate that this work will inspire further exploration within the pretrain-then-finetune paradigm for large-scale models.
Figure 1: Overview of our E2VPT framework. Under the pretrain-then-finetune paradigm, only the prompts in the transformer's input and backbone, are updated during the fine-tuning process, while all other components remain frozen. We further introduce pruning at two levels of granularity (i.e., token-wise and segment-wise) in (d) to eliminate unfavorable input prompts during rewinding.
See env_setup.sh
Note that you need to add a file (which is put in timm_added folder) to timm/models with path anaconda3/envs/[envs-name]/lib/python3.7/site-packages/timm/models
, and init it in __init__.py
by adding from .vision_transformer_changeVK import *
.
- E^2VPT related:
- MODEL.P_VK.NUM_TOKENS: prompt length on Value-Key pair
- MODEL.P_VK.NUM_TOKENS_P: prompt length (similar to VPT, but with pruning and rewinding)
- Fine-tuning method specification ("P_VK" as default method for E^2VPT):
- MODEL.TRANSFER_TYPE
- Vision backbones:
- DATA.FEATURE: specify which representation to use
- MODEL.TYPE: the general backbone type, e.g., "vit" or "swin"
- MODEL.MODEL_ROOT: folder with pre-trained model checkpoints
- Optimization related:
- SOLVER.BASE_LR: learning rate for the experiment
- SOLVER.WEIGHT_DECAY: weight decay value for the experiment
- DATA.BATCH_SIZE
- Datasets related:
- DATA.NAME
- DATA.DATAPATH: where you put the datasets
- DATA.NUMBER_CLASSES
- Others:
- OUTPUT_DIR: output dir of the final model and logs
As I am having a hard time preparing for all of the datasets, I am considering to release a compiled version of FGVC and VTAB-1k sooner or later. For now, you can follow the instructions in VPT for more details. We strictly follow the same datasets setup as VPT.
Download and place the pre-trained Transformer-based backbones to MODEL.MODEL_ROOT
. Note that you also need to rename the downloaded ViT-B/16 ckpt from ViT-B_16.npz
to imagenet21k_ViT-B_16.npz
.
See Table 9 in the Appendix for more details about pre-trained backbones.
Pre-trained Backbone | Pre-trained Objective | Link | md5sum |
---|---|---|---|
ViT-B/16 | Supervised | link | d9715d |
ViT-B/16 | MoCo v3 | link | 8f39ce |
ViT-B/16 | MAE | link | 8cad7c |
Swin-B | Supervised | link | bf9cc1 |
We will release the hyperparameters for all experiments in the paper soon. Stay tuned!
If you find our work helpful in your research, please cite it as:
@inproceedings{cheng2023e2vpt,
title={E2VPT: An Effective and Efficient Approach for Visual Prompt Tuning},
author={Cheng, Han and Qifan, Wang and Yiming, Cui and Zhiwen, Cao and Wenguan, Wang and Siyuan, Qi and Dongfang, Liu},
booktitle={International Conference on Computer Vision (ICCV)},
year={2023}
}
The majority of VPT is licensed under the CC-BY-NC 4.0 license (see LICENSE for details). Portions of the project are available under separate license terms: GitHub - google-research/task_adaptation and huggingface/transformers are licensed under the Apache 2.0 license; Swin-Transformer, ConvNeXt and ViT-pytorch are licensed under the MIT license; and MoCo-v3 and MAE are licensed under the Attribution-NonCommercial 4.0 International license.