🔥🔥🔥 A paper list of some recent works about Token Compress for Vit and VLM.
-
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity .[AVG-LLaVA;Github]
-
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs .[TRIM]
-
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings .[TG-LLaVA]
-
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding . [mPLUG-DocOwl2;Github]
-
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval . [TempMe;Video;Github]
-
Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information . [Recoverable Compression]
-
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments . [HiRED;Github]
-
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding . [Token-level;Github]
-
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models . [HiRes-LLaVA;]
-
TokenPacker: Efficient Visual Projector for Multimodal LLM . [TokenPacker;Github]
-
VoCo-LLaMA: Towards Vision Compression with Large Language Models . [VoCo-LLaMA;Github]
-
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models . [DeCo;Github]
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites . [InternVL;Pixel-Shuffle;Github]
-
CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference . [CATP;]
-
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models . [LLaVA-PruMerge;Github]
-
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-PLay Acceleration for VLLM Inference . [FastV;ECCV 2024;Github]
-
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model . [LDP-v2;Github]
- Honeybee: Locality-enhanced Projector for Multimodal LLM . [C-Abstractor;CVPR 2024;Github ]
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models . [LLaMA-VID;ECCV 2024;Github ]
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond . [Resampler;Github]
- CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers . [CrossGET; ICML 2024;Github]
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models . [Q-former;Github]
- Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning . [Token Compensator;ToCom;Github]
- Dynamic and Compressive Adaptation of Transformers From Images to Videos . [InTI;]
- LookupViT: Compressing visual information to a limited number of tokens . [LookupViT;DeepMind]
- PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation . [PYRA;ECCV 2024;Github]
- PPT: Token Pruning and Pooling for Efficient Vision Transformers . [PPT;Github]
- DiffRate : Differentiable Compression Rate for Efficient Vision Transformers . [DiffRate;ICCV 2023;Github]
- Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers . [TPS;CVPR 2023;Github]
- TOKEN MERGING: YOUR VIT BUT FASTER . [ToMe;Token Merging; ICLR 2023]
- Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention . [Adaptive Sparse ViT]
- EViT: Expediting Vision Transformers via Token Reorganizations . [EViT;ICLR 2022;Github]
- Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space . [ViT-Slim;CVPR 2022;Github]
- A-ViT: Adaptive Tokens for Efficient Vision Transformer . [A-Vit;]
- ATS: Adaptive Token Sampling For Efficient Vision Transformers . [ATS;ECCV 2022;Github]
- Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer . [Evo-ViT;AAAI 2022;Github]
- Patch Slimming for Efficient Vision Transformers . [Patch Slimming;]
- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsificationr . [DynamicViT;NeurIPS 2021;Github]