Ahmed-ballah / Transformer-in-Computer-Vision

A paper list of some recent Transformer-based CV works.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Transformer-in-VisionAwesome

A paper list of some recent Transformer-based CV works. If you find some ignored papers, please open issues or pull requests.

The list is too long and unreadable. A version update will be made as soon as possible in the future.

**Last updated: 2024/01/17

Table of Contents

Survey:

  • (arXiv 2024.01) Transformer for Object Re-Identification: A Survey. [Paper]

  • (arXiv 2023.12) A Comprehensive Study of Vision Transformers in Image Classification Tasks. [Paper]

  • (arXiv 2023.12) A Recent Survey of Vision Transformers for Medical Image Segmentation. [Paper]

  • (arXiv 2023.11) Explainability of Vision Transformers: A Comprehensive Review and New Perspectives. [Paper]

  • (arXiv 2023.10) Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability. [Paper]

  • (arXiv 2023.10) Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey. [Paper], [Awesome]

  • (arXiv 2023.09) Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art. [Paper]

  • (arXiv 2023.09) A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking. [Paper]

  • (arXiv 2023.07) A Survey of Techniques for Optimizing Transformer Inference. [Paper]

  • (arXiv 2023.07) Transformers in Reinforcement Learning: A Survey. [Paper]

  • (arXiv 2023.07) Vision Language Transformers: A Survey. [Paper]

  • (arXiv 2023.06) 2D Object Detection with Transformers: A Review. [Paper], [Awesome]

  • (arXiv 2023.05) Vision Transformers for Mobile Applications: A Short Survey. [Paper]

  • (arXiv 2023.05) A survey of the Vision Transformers and its CNN-Transformer based Variants. [Paper]

  • (arXiv 2023.05) Semantic Segmentation using Vision Transformers: A survey. [Paper]

  • (arXiv 2023.04) Transformer-based models and hardware acceleration analysis in autonomous driving: A survey. [Paper]

  • (arXiv 2023.04) Transformer-Based Visual Segmentation: A Survey. [Paper], [Awesome]

  • (arXiv 2023.02) Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey. [Paper]

  • (arXiv 2023.02) A Survey on Efficient Training of Transformers. [Paper]

  • (arXiv 2023.01) Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review. [Paper], [Awesome]

  • (arXiv 2022.11) Vision Transformers in Medical Imaging: A Review. [Paper]

  • (arXiv 2022.11) A Comprehensive Survey of Transformers for Computer Vision. [Paper]

  • (arXiv 2022.09) A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective. [Paper]

  • (arXiv 2022.09) Vision Transformers for Action Recognition: A Survey. [Paper]

  • (arXiv 2022.09) Transformers in Remote Sensing: A Survey. [Paper], [Awesome]

  • (arXiv 2022.08) Medical image analysis based on transformer: A Review. [Paper]

  • (arXiv 2022.08) 3D Vision with Transformers: A Survey. [Paper], [Awesome]

  • (arXiv 2022.05) Multimodal Learning with Transformers: A Survey. [Paper]

  • (arXiv 2022.05) Transformers in 3D Point Clouds: A Survey. [Paper]

  • (arXiv 2022.03) Vision Transformers in Medical Computer Vision - A Contemplative Retrospection. [Paper]

  • (arXiv 2022.03) Transformers Meet Visual Learning Understanding: A Comprehensive Review. [Paper]

  • (arXiv 2022.03) Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work. [Paper]

  • (arXiv 2022.02) Transformers in Medical Image Analysis: A Review. [Paper]

  • (arXiv 2022.01) Transformers in Medical Imaging: A Survey. [Paper], [Awesome]

  • (arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks. [Paper]

  • (arXiv 2022.01) Video Transformers: A Survey. [Paper]

  • (arXiv 2021.11) A Survey of Visual Transformers. [Paper]

  • (arXiv 2021.09) Survey: Transformer based Video-Language Pre-training. [Paper]

  • (arXiv 2021.03) Multi-modal Motion Prediction with Stacked Transformers. [Paper], [Code]

  • (arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision. [Paper]

  • (arXiv 2020.09) Efficient Transformers: A Survey. [Paper]

  • (arXiv 2020.01) Transformers in Vision: A Survey. [Paper]

Recent Papers

Action

  • (CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]
  • (arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]
  • (arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]
  • (arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]
  • (arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]
  • (arXiv 2021.05) Temporal Action Proposal Generation with Transformers, [Paper]
  • (arXiv 2021.06) End-to-end Temporal Action Detection with Transformer, [Paper], [Code]
  • (arXiv 2021.06) OadTR: Online Action Detection with Transformers, [Paper], [Code]
  • (arXiv 2021.07) Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition, [Paper]
  • (arXiv 2021.07) VideoLightFormer: Lightweight Action Recognition using Transformers, [Paper]
  • (arXiv 2021.07) Long Short-Term Transformer for Online Action Detection, [Paper]
  • (arXiv 2021.07) STAR: Sparse Transformer-based Action Recognition, [Paper], [Code]
  • (arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]
  • (arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]
  • (arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper], [Code]
  • (arXiv 2021.10) Lightweight Transformer in Federated Setting for Human Activity Recognition, [Paper]
  • (arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]
  • (arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]
  • (arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
  • (arXiv 2021.11) Evaluating Transformers for Lightweight Action Recognition, [Paper]
  • (arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]
  • (arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]
  • (arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]
  • (arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
  • (arXiv 2022.01) Transformers in Action:Weakly Supervised Action Segmentation, [Paper]
  • (arXiv 2022.02) ActionFormer: Localizing Moments of Actions with Transformers, [Paper], [Code]
  • (arXiv 2022.03) Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition, [Paper]
  • (arXiv 2022.03) TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration, [Paper], [Code]
  • (arXiv 2022.03) Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding, [Paper]
  • (arXiv 2022.03) LocATe: End-to-end Localization of Actions in 3D with Transformers, [Paper]
  • (arXiv 2022.03) DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition, [Paper], [Code]
  • (arXiv 2022.03) Multi-label Transformer for Action Unit Detection, [Paper]
  • (arXiv 2022.04) Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition, [Paper]
  • (arXiv 2022.04) TALLFormer: Temporal Action Localization with Long-memory Transformer, [Paper], [Code]
  • (arXiv 2022.04) TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting, [Paper], [Code]
  • (arXiv 2022.04) Detector-Free Weakly Supervised Group Activity Recognition, [Paper], [Code]
  • (arXiv 2022.05) Cross-modal Representation Learning for Zero-shot Action Recognition, [Paper], [Code]
  • (arXiv 2022.05) Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos, [Paper], [Code]
  • (arXiv 2022.05) Cross-subject Action Unit Detection with Meta Learning and Transformer-based Relation Modeling, [Paper]
  • (arXiv 2022.05) Cross-Enhancement Transformer for Action Segmentation, [Paper]
  • (arXiv 2022.05) Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation, [Paper]
  • (arXiv 2022.05) Future Transformer for Long-term Action Anticipation, [Paper], [Code]
  • (arXiv 2022.06) One-stage Action Detection Transformer, [Paper]
  • (arXiv 2022.06) Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition, [Paper]
  • (arXiv 2022.07) Hunting Group Clues with Transformers for Social Group Activity Recognition, [Paper]
  • (arXiv 2022.07) Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning, [Paper],[Code]
  • (arXiv 2022.07) Entry-Flipped Transformer for Inference and Prediction of Participant Behavior, [Paper],[Code]
  • (arXiv 2022.07) Action Quality Assessment with Temporal Parsing Transformer, [Paper]
  • (arXiv 2022.07) HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers, [Paper]
  • (arXiv 2022.07) An Efficient Spatio-Temporal Pyramid Transformer for Action Detection, [Paper]
  • (arXiv 2022.07) Action Quality Assessment using Transformers, [Paper]
  • (arXiv 2022.07) Unsupervised Domain Adaptation for Video Transformers in Action Recognition, [Paper],[Code]
  • (arXiv 2022.07) Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition, [Paper],[Code]
  • (arXiv 2022.08) Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition, [Paper]
  • (arXiv 2022.08) ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos, [Paper],[Code]
  • (arXiv 2022.08) Adaptive Perception Transformer for Temporal Action Localization, [Paper],[Code]
  • (arXiv 2022.08) A Circular Window-based Cascade Transformer for Online Action Detection, [Paper]
  • (arXiv 2022.09) Self-Supervised Multimodal Fusion Transformer for Passive Activity Recognition, [Paper]
  • (arXiv 2022.09) TASKED: Transformer-based Adversarial learning for human activity recognition using wearable sensors via Self-KnowledgE Distillation, [Paper]
  • (arXiv 2022.09) Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos, [Paper]
  • (arXiv 2022.09) Lightweight Transformers for Human Activity Recognition on Mobile Devices, [Paper]
  • (arXiv 2022.09) Multi-dataset Training of Transformers for Robust Action Recognition, [Paper],[Code]
  • (arXiv 2022.10) Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition, [Paper],[Code]
  • (arXiv 2022.10) STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition, [Paper]
  • (arXiv 2022.10) Transformer-based Action recognition in hand-object interacting scenarios, [Paper]
  • (arXiv 2022.10) Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation, [Paper]
  • (arXiv 2022.10) Holistic Interaction Transformer Network for Action Detection, [Paper],[Code]
  • (arXiv 2022.10) GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction, [Paper]
  • (arXiv 2022.10) Hypergraph Transformer for Skeleton-based Action Recognition, [Paper]
  • (arXiv 2022.11) SVFormer: Semi-supervised Video Transformer for Action Recognition, [Paper],[Code]
  • (arXiv 2022.11) Interaction Visual Transformer for Egocentric Action Anticipation, [Paper],[Code]
  • (arXiv 2023.02) Transformers in Action Recognition: A Review on Temporal Modeling, [Paper]
  • (arXiv 2023.02) Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer, [Paper],[Code]
  • (arXiv 2023.02) Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition, [Paper]
  • (arXiv 2023.02) Temporal Segment Transformer for Action Segmentation, [Paper]
  • (arXiv 2023.03) EgoViT: Pyramid Video Transformer for Egocentric Action Recognition, [Paper]
  • (arXiv 2023.03) Vision Transformer for Action Units Detection, [Paper]
  • (arXiv 2023.03) Group Activity Recognition using Self-supervised Approach of Spatiotemporal Transformers, [Paper]
  • (arXiv 2023.03) 3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition, [Paper]
  • (arXiv 2023.04) STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition, [Paper],[Code]
  • (arXiv 2023.04) End-to-End Spatio-Temporal Action Localisation with Video Transformers, [Paper]
  • (arXiv 2023.05) Distilled Mid-Fusion Transformer Networks for Multi-Modal Human Activity Recognition, [Paper]
  • (arXiv 2023.05) Multi-View Multi-Scale Driver Action Recognition with Vision Transformer, [Paper],[Code]
  • (arXiv 2023.05) Enhancing Transformer Backbone for Egocentric Video Action Segmentation, [Paper],[Code]
  • (arXiv 2023.05) A Multi-Modal Transformer Network for Action Detection, [Paper]
  • (arXiv 2023.06) Optimizing ViViT Training: Time and Memory Reduction for Action Recognition, [Paper]
  • (arXiv 2023.06) SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder and Transformer Network, [Paper]
  • (arXiv 2023.07) Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition, [Paper],[Code]
  • (arXiv 2023.07) VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation, [Paper]
  • (arXiv 2023.07) Multimodal Distillation for Egocentric Action Recognition, [Paper]
  • (arXiv 2023.07) Human Action Recognition in Still Images Using ConViT, [Paper]
  • (arXiv 2023.07) MSQNet: Actor-agnostic Action Recognition with Multi-modal Query, [Paper], [Code]
  • (arXiv 2023.07) Event-based Vision for Early Prediction of Manipulation Actions, [Paper]
  • (arXiv 2023.08) PAT: Position-Aware Transformer for Dense Multi-Label Action Detection, [Paper]
  • (arXiv 2023.08) Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning, [Paper]
  • (arXiv 2023.08) MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers, [Paper]
  • (arXiv 2023.08) Memory-and-Anticipation Transformer for Online Action Understanding, [Paper], [Code]
  • (arXiv 2023.08) Self-Feedback DETR for Temporal Action Detection, [Paper], [Code]
  • (arXiv 2023.08) EventTransAct: A video transformer-based framework for Event-camera based action recognition, [Paper], [Code]
  • (arXiv 2023.08) Topology-aware MLP for Skeleton-based Action Recognition, [Paper], [Code]
  • (arXiv 2023.08) Prompt-enhanced Hierarchical Transformer Elevating Cardiopulmonary Resuscitation Instruction via Temporal Action Segmentation, [Paper]
  • (arXiv 2023.09) COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers, [Paper], [Code]
  • (arXiv 2023.09) Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition, [Paper]
  • (arXiv 2023.09) SkeleTR: Towrads Skeleton-based Action Recognition in the Wild, [Paper]
  • (arXiv 2023.09) Egocentric RGB+Depth Action Recognition in Industry-Like Settings, [Paper]
  • (arXiv 2023.10) POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization, [Paper]
  • (arXiv 2023.11) Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition, [Paper]
  • (arXiv 2023.11) Act-VIT: A Representationally Robust Attention Architecture for Skeleton Based Action Recognition Using Vision Transformer, [Paper]
  • (arXiv 2023.11) SigFormer: Sparse Signal-Guided Transformer for Multi-Modal Human Action Segmentation, [Paper], [Code]
  • (arXiv 2023.11) GeoDeformer: Geometric Deformable Transformer for Action Recognition, [Paper]
  • (arXiv 2023.12) REACT: Recognize Every Action Everywhere All At Once, [Paper]
  • (arXiv 2023.12) Adapting Short-Term Transformers for Action Detection in Untrimmed Videos, [Paper]
  • (arXiv 2023.12) STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition, [Paper],[Code]
  • (arXiv 2024.01) Multi-view Distillation based on Multi-modal Fusion for Few-shot Action Recognition, [Paper],[Code]

Active Learning

  • (arXiv 2022.06) Visual Transformer for Task-aware Active Learning, [Paper], [Code]

Adversarial Attacks

  • (arXiv 2022.06) Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO, [Paper], [Code]
  • (arXiv 2022.06) Backdoor Attacks on Vision Transformers, [Paper], [Code]
  • (arXiv 2022.06) Defending Backdoor Attacks on Vision Transformer via Patch Processing, [Paper]
  • (arXiv 2022.07) Towards Efficient Adversarial Training on Vision Transformers, [Paper]
  • (arXiv 2022.08) Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem, [Paper], [Code]
  • (arXiv 2022.08) Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks, [Paper]
  • (arXiv 2023.01) Inference Time Evidences of Adversarial Attacks for Forensic on Transformers, [Paper]
  • (arXiv 2023.03) Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization, [Paper]
  • (arXiv 2023.05) On enhancing the robustness of Vision Transformers: Defensive Diffusion, [Paper], [Code]
  • (arXiv 2023.06) Pre-trained transformer for adversarial purification, [Paper]
  • (arXiv 2023.07) Random Position Adversarial Patch for Vision Transformers, [Paper]
  • (arXiv 2023.07) Enhanced Security against Adversarial Examples Using a Random Ensemble of Encrypted Vision Transformer Models, [Paper]
  • (arXiv 2023.09) Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks, [Paper]
  • (arXiv 2023.09) RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias, [Paper]
  • (arXiv 2023.10) Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models, [Paper]
  • (arXiv 2023.10) ConViViT -- A Deep Neural Network Combining Convolutions and Factorized Self-Attention for Human Activity Recognition, [Paper]
  • (arXiv 2023.10) Blacksmith: Fast Adversarial Training of Vision Transformers via a Mixture of Single-step and Multi-step Methods, [Paper]
  • (arXiv 2023.11) DialMAT: Dialogue-Enabled Transformer with Moment-Based Adversarial Training, [Paper]
  • (arXiv 2023.11) Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches, [Paper]
  • (arXiv 2023.12) MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness, [Paper], [Code]
  • (arXiv 2024.01) FullLoRA-AT: Efficiently Boosting the Robustness of Pretrained Vision Transformers, [Paper]

Anomaly Detection

  • (arXiv 2021.04) VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization, [Paper]
  • (arXiv 2021.04) Inpainting Transformer for Anomaly Detection, [Paper]
  • (arXiv 2022.03) AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder, [Paper]
  • (arXiv 2022.06) Anomaly detection in surveillance videos using transformer based attention model, [Paper], [Code]
  • (arXiv 2022.06) Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection, [Paper]
  • (arXiv 2022.08) HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization, [Paper], [Code]
  • (arXiv 2022.08) ADTR: Anomaly Detection Transformer with Feature Reconstruction, [Paper]
  • (arXiv 2022.09) Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection, [Paper], [Code]
  • (arXiv 2022.09) Anomaly Detection in Aerial Videos with Transformers, [Paper], [Code]
  • (arXiv 2022.10) Masked Transformer for image Anomaly Localization, [Paper]
  • (arXiv 2022.11) Generalizable Industrial Visual Anomaly Detection with Self-Induction Vision Transformer, [Paper]
  • (arXiv 2023.03) Incremental Self-Supervised Learning Based on Transformer for Anomaly Detection and Localization, [Paper]
  • (arXiv 2023.03) Unsupervised Anomaly Detection with Local-Sensitive VQVAE and Global-Sensitive Transformers, [Paper]
  • (arXiv 2023.03) Visual Anomaly Detection via Dual-Attention Transformer and Discriminative Flow, [Paper]
  • (arXiv 2023.05) Multiresolution Feature Guidance Based Transformer for Anomaly Detection, [Paper]
  • (arXiv 2023.06) Efficient Anomaly Detection with Budget Annotation Using Semi-Supervised Residual Transformer, [Paper], [Code]
  • (arXiv 2023.07) SelFormaly: Towards Task-Agnostic Unified Anomaly Detection, [Paper]
  • (arXiv 2023.08) Patch-wise Auto-Encoder for Visual Anomaly Detection, [Paper]
  • (arXiv 2023.09) Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation, [Paper]
  • (arXiv 2023.10) Hierarchical Vector Quantized Transformer for Multi-class Unsupervised Anomaly Detection, [Paper], [Code]
  • (arXiv 2023.12) Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning, [Paper]
  • (arXiv 2023.12) Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection, [Paper], [Code]

Assessment

  • (arXiv 2021.01) Transformer for Image Quality Assessment, [Paper], [Code]
  • (arXiv 2021.04) Perceptual Image Quality Assessment with Transformers, [Paper], [Code]
  • (arXiv 2021.08) No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency, [Paper], [Code]
  • (arXiv 2021.08) MUSIQ: Multi-scale Image Quality Transformer, [Paper], [Code]
  • (arXiv 2021.10) VTAMIQ: Transformers for Attention Modulated Image Quality Assessment, [Paper]
  • (arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]
  • (arXiv 2022.03) Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment, [Paper]
  • (arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based Image Quality Assessment, [Paper], [Code]
  • (arXiv 2022.05) SwinIQA: Learned Swin Distance for Compressed Image Quality Assessment, [Paper]
  • (arXiv 2022.05) MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion, [Paper]
  • (arXiv 2022.08) DAHiTrA: Damage Assessment Using a Novel Hierarchical Transformer Architecture, [Paper]
  • (arXiv 2022.10) DCVQE: A Hierarchical Transformer for Video Quality Assessment, [Paper]
  • (arXiv 2023.03) ST360IQ: No-Reference Omnidirectional Image Quality Assessment with Spherical Vision Transformers, [Paper], [Code]
  • (arXiv 2023.03) MRET: Multi-resolution Transformer for Video Quality Assessment, [Paper]
  • (arXiv 2023.05) Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality Token, [Paper], [Code]
  • (arXiv 2023.08) Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment, [Paper]
  • (arXiv 2023.12) Activating Frequency and ViT for 3D Point Cloud Quality Assessment without Reference, [Paper], [Code]
  • (arXiv 2024.01) Video Quality Assessment Based on Swin TransformerV2 and Coarse to Fine Strategy, [Paper], [Code]

Augmentation

  • (arXiv 2022.10) TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers, [Paper], [Code]
  • (arXiv 2022.12) SMMix: Self-Motivated Image Mixing for Vision Transformers, [Paper], [Code]
  • (arXiv 2023.05) Transformer-based Sequence Labeling for Audio Classification based on MFCCs, [Paper]

Audio

  • (arXiv 2022.11) ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation, [Paper]
  • (arXiv 2023.03) Multiscale Audio Spectrogram Transformer for Efficient Audio Classification, [Paper]
  • (arXiv 2023.03) ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers, [Paper]
  • (arXiv 2023.07) AVSegFormer: Audio-Visual Segmentation with Transformer, [Paper], [Code]
  • (arXiv 2023.11) Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing, [Paper]
  • (arXiv 2023.12) Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling, [Paper]
  • (arXiv 2024.01) Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classificationg, [Paper]

Bird's-Eye-View

  • (arXiv 2022.03) BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, [Paper], [Code]
  • (arXiv 2022.05) ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation, [Paper], [Code]
  • (arXiv 2022.06) PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images, [Paper]
  • (arXiv 2022.06) Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer, [Paper], [Code]
  • (arXiv 2022.06) PolarFormer: Multi-camera 3D Object Detection with Polar Transformer, [Paper], [Code]
  • (arXiv 2022.07) CoBEVT: Cooperative Bird's Eye View ation with Sparse Transformers, [Paper]
  • (arXiv 2022.07) UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View, [Paper]
  • (arXiv 2022.09) A Dual-Cycled Cross-View Transformer Network for Unified Road Layout Estimation and 3D Object Detection in the Bird's-Eye-View, [Paper]
  • (arXiv 2022.09) BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for BEV 3D Object Detection, [Paper]
  • (arXiv 2023.02) DA-BEV: Depth Aware BEV Transformer for 3D Object Detection, [Paper]
  • (arXiv 2023.03) TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving, [Paper], [Code]
  • (arXiv 2023.04) VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection, [Paper], [Code]
  • (arXiv 2023.04) FedBEVT: Federated Learning Bird's Eye View Perception Transformer in Road Traffic Systems, [Paper]
  • (arXiv 2023.04) A Cross-Scale Hierarchical Transformer with Correspondence-Augmented Attention for inferring Bird's-Eye-View ation, [Paper]
  • (arXiv 2023.06) OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection, [Paper]
  • (arXiv 2023.06) An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection, [Paper]
  • (arXiv 2023.07) HeightFormer: Explicit Height Modeling without Extra Data for Camera-only 3D Object Detection in Bird’s Eye View, [Paper]
  • (arXiv 2023.08) UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation, [Paper], [Code]
  • (arXiv 2023.09) FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Objection, [Paper]
  • (arXiv 2023.10) Towards Generalizable Multi-Camera 3D Object Detection via Perspective Debiasing, [Paper]
  • (arXiv 2023.12) Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach, [Paper]
  • (arXiv 2023.12) BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection, [Paper]
  • (arXiv 2023.12) COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction, [Paper]
  • (arXiv 2023.12) Learned Fusion: 3D Object Detection using Calibration-Free Transformer Feature Fusion, [Paper]
  • (arXiv 2023.12) Diffusion-Based Particle-DETR for BEV Perception, [Paper]
  • (arXiv 2023.12) Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers, [Paper]
  • (arXiv 2024.01) WidthFormer: Toward Efficient Transformer-based BEV View Transformation, [Paper], [Code]

Captioning

  • (arXiv 2021.01) CPTR: Full Transformer Network for Image Captioning, [Paper]
  • (arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]
  • (arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]
  • (arXiv 2021.06) Semi-Autoregressive Transformer for Image Captioning, [Paper], [Code]
  • (arXiv 2021.08) Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers, [Paper]
  • (arXiv 2021.08) Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning, [Paper], [Code]
  • (arXiv 2021.09) Bornon: Bengali Image Captioning with Transformer-based Deep learning approach, [Paper]
  • (arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper], [Code]
  • (arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]
  • (arXiv 2021.10) Geometry Attention Transformer with Position-aware LSTMs for Image Captioning, [Paper]
  • (arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
  • (arXiv 2021.11) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]
  • (arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]
  • (arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]
  • (arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning, [Paper], [Code]
  • (arXiv 2022.02) Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation, [Paper], [Code]
  • (arXiv 2022.03) X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning, [Paper]
  • (arXiv 2022.03) End-to-End Transformer Based Model for Image Captioning, [Paper]
  • (arXiv 2022.03) Quantifying Societal Bias Amplification in Image Captioning, [Paper]
  • (arXiv 2022.04) Image Captioning In the Transformer Age, [Paper]
  • (arXiv 2022.05) Dual-Level Decoupled Transformer for Video Captioning, [Paper]
  • (arXiv 2022.05) Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning, [Paper], [Code]
  • (arXiv 2022.06) Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching, [Paper], [Code]
  • (arXiv 2022.07) ExpansionNet: exploring the sequence length bottleneck in the Transformer for Image Captioning, [Paper], [Code]
  • (arXiv 2022.07) GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features, [Paper], [Code]
  • (arXiv 2022.07) Retrieval-Augmented Transformer for Image Captioning, [Paper]
  • (arXiv 2022.09) vieCap4H-VLSP 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM, [Paper], [Code]
  • (arXiv 2022.11) VieCap4H - VLSP 2021: ObjectAoA -- Enhancing performance of Object Relation Transformer with Attention on Attention for Vietnamese image captioning, [Paper]
  • (arXiv 2022.11) VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning, [Paper], [Code]
  • (arXiv 2022.11) GRiT: A Generative Region-to-text Transformer for Object Understanding, [Paper], [Code]
  • (arXiv 2023.01) End-to-End 3D Dense Captioning with Vote2Cap-DETR, [Paper], [Code]
  • (arXiv 2023.02) ADAPT: Action-aware Driving Caption Transformer, [Paper], [Code]
  • (arXiv 2023.02) DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps, [Paper]
  • (arXiv 2023.03) Neighborhood Contrastive Transformer for Change Captioning, [Paper], [Code]
  • (arXiv 2023.03) Comparative study of Transformer and LSTM Network with attention mechanism on Image Captioning, [Paper]
  • (arXiv 2023.03) Text with Knowledge Graph Augmented Transformer for Video Captioning, [Paper]
  • (arXiv 2023.05) Transforming Visual Scene Graphs to Image Captions, [Paper]
  • (arXiv 2023.07) Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning, [Paper]
  • (arXiv 2023.08) RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension, [Paper], [Code]
  • (arXiv 2023.08) Enhancing image captioning with depth information using a Transformer-based framework, [Paper]
  • (arXiv 2023.09) Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning, [Paper], [Code]
  • (arXiv 2023.09) Collaborative Three-Stream Transformers for Video Captioning, [Paper], [Code]
  • (arXiv 2023.09) Accurate and Fast Compressed Video Captioning, [Paper], [Code]

Change Detection

  • (arXiv 2022.01) A Transformer-Based Siamese Network for Change Detection, [Paper], [Code]
  • (arXiv 2022.07) IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection, [Paper]
  • (arXiv 2023.08) UCDFormer: Unsupervised Change Detection Using a Transformer-driven Image Translation, [Paper], [Code]
  • (arXiv 2023.09) Changes-Aware Transformer: Learning Generalized Changes Representation, [Paper]
  • (arXiv 2023.10) Transformer-based Multimodal Change Detection with Multitask Consistency Constraints, [Paper]
  • (arXiv 2023.10) TransY-Net:Learning Fully Transformer Networks for Change Detection of Remote Sensing Images, [Paper], [Code]
  • (arXiv 2023.11) MS-Former: Memory-Supported Transformer for Weakly Supervised Change Detection with Patch-Level Annotations, [Paper], [Code]
  • (arXiv 2023.12) Adapting Vision Transformer for Efficient Change Detection, [Paper]

Classification (Backbone)

  • (ICLR'21) MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]
  • (CVPR'20) Feature Pyramid Transformer, [Paper], [Code]
  • (ICLR'21) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]
  • (arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]
  • (arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper], [Code]
  • (arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]
  • (arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper] , [Code]
  • (arXiv.2021.02) Conditional Positional Encodings for Vision Transformers, [Paper], [Code]
  • (arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]
  • (arXiv 2021.03) Transformer in Transformer, [Paper], [Code]
  • (arXiv 2021.03) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases, [Paper], [Code]
  • (arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]
  • (arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]
  • (arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]
  • (arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]
  • (arXiv 2021.03) Understanding Robustness of Transformers for Image Classification, [Paper]
  • (arXiv 2021.03) Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, [Paper]
  • (arXiv 2021.03) CvT: Introducing Convolutions to Vision Transformers, [Paper], [Code]
  • (arXiv 2021.03) Rethinking Spatial Dimensions of Vision Transformers, [Paper], [Code]
  • (arXiv 2021.03) Going deeper with Image Transformers, [Paper]
  • (arXiv 2021.04) LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference, [Paper]
  • (arXiv 2021.04) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]
  • (arXiv 2021.04) LocalViT: Bringing Locality to Vision Transformers, [Paper], [Code]
  • (arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]
  • (arXiv 2021.04) Co-Scale Conv-Attentional Image Transformers, [Paper], [Code]
  • (arXiv 2021.04) Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet, [Paper], [Code]
  • (arXiv 2021.04) So-ViT: Mind Visual Tokens for Vision Transformer, [Paper]
  • (arXiv 2021.04) Multiscale Vision Transformers, [Paper], [Code]
  • (arXiv 2021.04) Visformer: The Vision-friendly Transformer, [Paper], [Code]
  • (arXiv 2021.04) Improve Vision Transformers Training by Suppressing Over-smoothing, [Paper], [Code]
  • (arXiv 2021.04) Twins: Revisiting the Design of Spatial Attention in Vision Transformers, [Paper], [Code]
  • (arXiv 2021.04) ConTNet: Why not use convolution and transformer at the same time, [Paper], [Code]
  • (arXiv 2021.05) Rethinking the Design Principles of Robust Vision Transformer, [Paper], [Code]
  • (arXiv 2021.05) Vision Transformers are Robust Learners, [Paper], [Code]
  • (arXiv 2021.05) Rethinking Skip Connection with Layer Normalization in Transformers and ResNets, [Paper], [Code]
  • (arXiv 2021.05) Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead, [Paper]
  • (arXiv 2021.05) Intriguing Properties of Vision Transformers, [Paper], [Code]
  • (arXiv 2021.05) Aggregating Nested Transformers, [Paper]
  • (arXiv 2021.05) ResT: An Efficient Transformer for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.06) DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, [Paper], [Code]
  • (arXiv 2021.06) When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations, [Paper]
  • (arXiv 2021.06) Container: Context Aggregation Network, [Paper]
  • (arXiv 2021.06) TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication, [Paper]
  • (arXiv 2021.06) KVT: k-NN Attention for Boosting Vision Transformers, [Paper]
  • (arXiv 2021.06) MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens, [Paper], [Code]
  • (arXiv 2021.06) Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length, [Paper]
  • (arXiv 2021.06) Less is More: Pay Less Attention in Vision Transformers, [Paper]
  • (arXiv 2021.06) FoveaTer: Foveated Transformer for Image Classification, [Paper]
  • (arXiv 2021.06) An Attention Free Transformer, [Paper]
  • (arXiv 2021.06) Glance-and-Gaze Vision Transformer, [Paper], [Code]
  • (arXiv 2021.06) RegionViT: Regional-to-Local Attention for Vision Transformers, [Paper]
  • (arXiv 2021.06) Chasing Sparsity in Vision Transformers: An End-to-End Exploration, [Paper], [Code]
  • (arXiv 2021.06) Scaling Vision Transformers, [Paper]
  • (arXiv 2021.06) CAT: Cross Attention in Vision Transformer, [Paper], [Code]
  • (arXiv 2021.06) On Improving Adversarial Transferability of Vision Transformers, [Paper], [Code]
  • (arXiv 2021.06) Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight, [Paper]
  • (arXiv 2021.06) Patch Slimming for Efficient Vision Transformers, [Paper]
  • (arXiv 2021.06) Transformer in Convolutional Neural Networks, [Paper], [Code]
  • (arXiv 2021.06) ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias, [Paper], [Code]
  • (arXiv 2021.06) Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, [Paper]
  • (arXiv 2021.06) Refiner: Refining Self-attention for Vision Transformers, [Paper]
  • (arXiv 2021.06) Reveal of Vision Transformers Robustness against Adversarial Attacks, [Paper]
  • (arXiv 2021.06) Efficient Training of Visual Transformers with Small-Size Datasets, [Paper]
  • (arXiv 2021.06) Delving Deep into the Generalization of Vision Transformers under Distribution Shifts, [Paper]
  • (arXiv 2021.06) BEIT: BERT Pre-Training of Image Transformers, [Paper], [Code]
  • (arXiv 2021.06) XCiT: Cross-Covariance Image Transformers, [Paper], [Code]
  • (arXiv 2021.06) How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers, [Paper], [Code1], [Code2]
  • (arXiv 2021.06) Exploring Vision Transformers for Fine-grained Classification, [Paper], [Code]
  • (arXiv 2021.06) TokenLearner: What Can 8 Learned Tokens Do for Images and Videos, [Paper]
  • (arXiv 2021.06) Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers, [Paper], [Code]
  • (arXiv 2021.06) VOLO: Vision Outlooker for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.06) IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers, [Paper], [Project]
  • (arXiv 2021.06) PVTv2: Improved Baselines with Pyramid Vision Transformer, [Paper], [Code]
  • (arXiv 2021.06) Early Convolutions Help Transformers See Better, [Paper]
  • (arXiv 2021.06) Multi-Exit Vision Transformer for Dynamic Inference, [Paper]
  • (arXiv 2021.07) Augmented Shortcuts for Vision Transformers, [Paper]
  • (arXiv 2021.07) Improving the Efficiency of Transformers for Resource-Constrained Devices, [Paper]
  • (arXiv 2021.07) CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, [Paper], [Code]
  • (arXiv 2021.07) Focal Self-attention for Local-Global Interactions in Vision Transformers, [Paper]
  • (arXiv 2021.07) Cross-view Geo-localization with Evolving Transformer, [Paper]
  • (arXiv 2021.07) What Makes for Hierarchical Vision Transformer, [Paper]
  • (arXiv 2021.07) Efficient Vision Transformers via Fine-Grained Manifold Distillation, [Paper]
  • (arXiv 2021.07) Vision Xformers: Efficient Attention for Image Classification, [Paper]
  • (arXiv 2021.07) Long-Short Transformer: Efficient Transformers for Language and Vision, [Paper]
  • (arXiv 2021.07) Feature Fusion Vision Transformer for Fine-Grained Visual Categorization, [Paper]
  • (arXiv 2021.07) Local-to-Global Self-Attention in Vision Transformers, [Paper], [Code]
  • (arXiv 2021.07) Visual Parser: Representing Part-whole Hierarchies with Transformers, [Paper], [Code]
  • (arXiv 2021.07) CMT: Convolutional Neural Networks Meet Vision Transformers, [Paper]
  • (arXiv 2021.07) Combiner: Full Attention Transformer with Sparse Computation Cost, [Paper]
  • (arXiv 2021.07) A Comparison of Deep Learning Classification Methods on Small-scale Image Data set: from Convolutional Neural Networks to Visual Transformers, [Paper]
  • (arXiv 2021.07) Contextual Transformer Networks for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.07) Rethinking and Improving Relative Position Encoding for Vision Transformer, [Paper], [Code]
  • (arXiv 2021.08) CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention, [Paper], [Code]
  • (arXiv 2021.08) Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer, [Paper]
  • (arXiv 2021.08) Vision Transformer with Progressive Sampling, [Paper], [Code]
  • (arXiv 2021.08) Armour: Generalizable Compact Self-Attention for Vision Transformers, [Paper]
  • (arXiv 2021.08) ConvNets vs. Transformers: Whose Visual Representations are More Transferable, [Paper]
  • (arXiv 2021.08) Mobile-Former: Bridging MobileNet and Transformer, [Paper]
  • (arXiv 2021.08) Do Vision Transformers See Like Convolutional Neural Networks, [Paper]
  • (arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]
  • (arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]
  • (arXiv 2021.08) Scaled ReLU Matters for Training Vision Transformers, [Paper]
  • (arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]
  • (arXiv 2021.09) DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers, [Paper], [Code]
  • (arXiv 2021.09) Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers, [Paper]
  • (arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]
  • (arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
  • (arXiv 2021.10) MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer, [Paper]
  • (arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]
  • (arXiv 2021.10) Token Pooling in Visual Transformers, [Paper]
  • (arXiv 2021.10) NViT: Vision Transformer Compression and Parameter Redistribution, [Paper]
  • (arXiv 2021.10) Adversarial Token Attacks on Vision Transformers, [Paper]
  • (arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]
  • (arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]
  • (arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]
  • (arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]
  • (arXiv 2021.11) Can Vision Transformers Perform Convolution, [Paper]
  • (arXiv 2021.11) Sliced Recursive Transformer, [Paper], [Code]
  • (arXiv 2021.11) Hybrid BYOL-ViT: Efficient approach to deal with small Datasets, [Paper]
  • (arXiv 2021.11) Are Transformers More Robust Than CNNs, [Paper], [Code]
  • (arXiv 2021.11) iBOT: Image BERT Pre-Training with Online Tokenizer, [Paper]
  • (arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]
  • (arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]
  • (arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]
  • (arXiv 2021.11) Are Vision Transformers Robust to Patch Perturbations, [Paper]
  • (arXiv 2021.11) Discrete Representations Strengthen Vision Transformer Robustness, [Paper]
  • (arXiv 2021.11) Zero-Shot Certified Defense against Adversarial Patches with Vision Transformers, [Paper]
  • (arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]
  • (arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]
  • (arXiv 2021.11) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]
  • (arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]
  • (arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]
  • (arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]
  • (arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]
  • (arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]
  • (arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]
  • (arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]
  • (arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]
  • (arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]
  • (arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]
  • (arXiv 2021.12) Dynamic Token Normalization Improves Vision Transformer, [Paper], [Code]
  • (arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]
  • (arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]
  • (arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]
  • (arXiv 2021.12) Couplformer:Rethinking Vision Transformer with Coupling Attention Map, [Paper]
  • (arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]
  • (arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]
  • (arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]
  • (arXiv 2021.12) MPViT: Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]
  • (arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation, [Paper]
  • (arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]
  • (arXiv 2021.12) SimViT: Exploring a Simple Vision Transformer with sliding windows, [Paper], [Code]
  • (arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]
  • (arXiv 2021.12) ViR: the Vision Reservoir, [Paper]
  • (arXiv 2021.12) Augmenting Convolutional networks with attention-based aggregation, [Paper]
  • (arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention, [Paper], [Code]
  • (arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]
  • (arXiv 2021.12) Stochastic Layers in Vision Transformers, [Paper]
  • (arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]
  • (arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]
  • (arXiv 2022.01) QuadTree Attention for Vision Transformers, [Paper], [Code]
  • (arXiv 2022.01) TerViT: An Efficient Ternary Vision Transformer, [Paper]
  • (arXiv 2022.01) UniFormer: Unifying Convolution and Self-attention for Visual Recognition, [Paper], [Code]
  • (arXiv 2022.01) Patches Are All You Need?, [Paper], [Code]
  • (arXiv 2022.01) Convolutional Xformers for Vision, [Paper], [Code]
  • (arXiv 2022.01) When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism, [Paper], [Code]
  • (arXiv 2022.01) Training Vision Transformers with Only 2040 Images, [Paper]
  • (arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [Paper]
  • (arXiv 2022.01) Aggregating Global Features into Local Vision Transformer, [Paper],[Code]
  • (arXiv 2022.01) BOAT: Bilateral Local Attention Vision Transformer, [Paper]
  • (arXiv 2022.02) BViT: Broad Attention based Vision Transformer, [Paper],[Code]
  • (arXiv 2022.02) How Do Vision Transformers Work, [Paper],[Code]
  • (arXiv 2022.02) Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations, [Paper],[Code]
  • (arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond, [Paper]
  • (arXiv 2022.02) Learning to Merge Tokens in Vision Transformers, [Paper]
  • (arXiv 2022.02) Auto-scaling Vision Transformers without Training, [Paper],[Code]
  • (arXiv 2022.03) Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions, [Paper]
  • (arXiv 2022.03) D^2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention, [Paper]
  • (arXiv 2022.03) BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning, [Paper]
  • (arXiv 2022.03) Multi-Tailed Vision Transformer for Efficient Inference, [Paper]
  • (arXiv 2022.03) ViT-P: Rethinking Data-efficient Vision Transformers from Locality, [Paper]
  • (arXiv 2022.03) Coarse-to-Fine Vision Transformer, [Paper],[Code]
  • (arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention, [Paper]
  • (arXiv 2022.03) EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers, [Paper]
  • (arXiv 2022.03) WaveMix: Resource-efficient Token Mixing for Images, [Paper], [Code]
  • (arXiv 2022.03) Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice, [Paper], [Code]
  • (arXiv 2022.03) Visualizing and Understanding Patch Interactions in Vision Transformer, [Paper]
  • (arXiv 2022.03) EIT: Efficiently Lead Inductive Biases to ViT, [Paper], [Code]
  • (arXiv 2022.03) The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy, [Paper], [Code]
  • (arXiv 2022.03) Towards Practical Certifiable Patch Defense with Vision Transformer, [Paper]
  • (arXiv 2022.03) Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations, [Paper], [Code]
  • (arXiv 2022.03) Are Vision Transformers Robust to Spurious Correlations, [Paper], [Code]
  • (arXiv 2022.03) Three things everyone should know about Vision Transformers, [Paper]
  • (arXiv 2022.03) ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer, [Paper]
  • (arXiv 2022.03) GradViT: Gradient Inversion of Vision Transformers, [Paper], [Code]
  • (arXiv 2022.03) Learning Patch-to-Cluster Attention in Vision Transformer, [Paper]
  • (arXiv 2022.03) Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [Paper]
  • (arXiv 2022.03) Beyond Fixation: Dynamic Window Visual Transformer, [Paper], [Code]
  • (arXiv 2022.03) Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness, [Paper]
  • (arXiv 2022.03) Automated Progressive Learning for Efficient Training of Vision Transformers, [Paper], [Code]
  • (arXiv 2022.03) Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers, [Paper], [Code]
  • (arXiv 2022.03) CaCo: Both Positive and Negative Samples are Directly Learnable via Cooperative-adversarial Contrastive Learning, [Paper], [Code]
  • (arXiv 2022.03) SepViT: Separable Vision Transformer, [Paper]
  • (arXiv 2022.03) Fine-tuning Image Transformers using Learnable Memory, [Paper]
  • (arXiv 2022.03) Parameter-efficient Fine-tuning for Vision Transformers, [Paper]
  • (arXiv 2022.03) MaxViT: Multi-Axis Vision Transformer, [Paper]
  • (arXiv 2022.04) BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning, [Paper]
  • (arXiv 2022.04) Improving Vision Transformers by Revisiting High-frequency Components, [Paper]
  • (arXiv 2022.04) MixFormer: Mixing Features across Windows and Dimensions, [Paper], [Code]
  • (arXiv 2022.04) DaViT: Dual Attention Vision Transformers, [Paper], [Code]
  • (arXiv 2022.04) Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels, [Paper]
  • (arXiv 2022.04) MiniViT: Compressing Vision Transformers with Weight Multiplexing, [Paper]
  • (arXiv 2022.04) DeiT III: Revenge of the ViT, [Paper]
  • (arXiv 2022.04) Neighborhood Attention Transformer, [Paper], [Code]
  • (arXiv 2022.04) ResT V2: Simpler, Faster and Stronger, [Paper], [Code]
  • (arXiv 2022.04) VSA: Learning Varied-Size Window Attention in Vision Transformers, [Paper], [Code]
  • (arXiv 2022.04) OCFormer: One-Class Transformer Network for Image Classification, [Paper]
  • (arXiv 2022.04) Adaptive Split-Fusion Transformer, [Paper], [Code]
  • (arXiv 2022.04) Understanding The Robustness in Vision Transformers, [Paper], [Code]
  • (arXiv 2022.05) Better plain ViT baselines for ImageNet-1k, [Paper], [Code]
  • (arXiv 2022.05) EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers, [Paper], [Code]
  • (arXiv 2022.05) ConvMAE: Masked Convolution Meets Masked Autoencoders, [Paper], [Code]
  • (arXiv 2022.05) Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers, [Paper]
  • (arXiv 2022.05) TRT-ViT: TensorRT-oriented Vision Transformer, [Paper]
  • (arXiv 2022.05) Super Vision Transformer, [Paper], [Code]
  • (arXiv 2022.05) Deeper vs Wider: A Revisit of Transformer Configuration, [Paper]
  • (arXiv 2022.05) Vision Transformers in 2022: An Update on Tiny ImageNet, [Paper], [Code]
  • (arXiv 2022.05) Privacy-Preserving Image Classification Using Vision Transformer, [Paper]
  • (arXiv 2022.05) Inception Transformer, [Paper], [Code]
  • (arXiv 2022.05) MoCoViT: Mobile Convolutional Vision Transformer, [Paper], [Code]
  • (arXiv 2022.05) Breaking the Chain of Gradient Leakage in Vision Transformers, [Paper], [Code]
  • (arXiv 2022.05) Hierarchical Vision Transformer for Masked Image Modeling, [Paper], [Code]
  • (arXiv 2022.05) Fast Vision Transformers with HiLo Attention, [Paper], [Code]
  • (arXiv 2022.05) AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition, [Paper], [Code]
  • (arXiv 2022.05) X-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
  • (arXiv 2022.05) Architecture-Agnostic Masked Image Modeling – From ViT back to CNN, [Paper], [Code]
  • (arXiv 2022.05) HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling, [Paper]
  • (arXiv 2022.05) EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition, [Paper], [Code]
  • (arXiv 2022.06) EfficientFormer: Vision Transformers at MobileNet Speed, [Paper], [Code]
  • (arXiv 2022.06) Optimizing Relevance Maps of Vision Transformers Improves Robustness, [Paper], [Code]
  • (arXiv 2022.06) Separable Self-attention for Mobile Vision Transformers, [Paper], [Code]
  • (arXiv 2022.06) Spatial Entropy Regularization for Vision Transformers, [Paper]
  • (arXiv 2022.06) Peripheral Vision Transformer, [Paper], [Code]
  • (arXiv 2022.06) SP-ViT: Learning 2D Spatial Priors for Vision Transformers, [Paper]
  • (arXiv 2022.06) FIT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification, [Paper], [Code]
  • (arXiv 2022.06) SimA: Simple Softmax-free Attention for Vision Transformers, [Paper], [Code]
  • (arXiv 2022.06) Vicinity Vision Transformer, [Paper], [Code]
  • (arXiv 2022.06) EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications, [Paper], [Code]
  • (arXiv 2022.06) Global Context Vision Transformers, [Paper], [Code]
  • (arXiv 2022.06) EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm, [Paper], [Code]
  • (arXiv 2022.06) A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers, [Paper]
  • (arXiv 2022.06) Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment, [Paper], [Code]
  • (arXiv 2022.06) Continual Learning with Transformers for Image Classification, [Paper]
  • (arXiv 2022.07) Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning, [Paper]
  • (arXiv 2022.07) Rethinking Query-Key Pairwise Interactions in Vision Transformers, [Paper]
  • (arXiv 2022.07) Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks, [Paper], [Code]
  • (arXiv 2022.07) Softmax-free Linear Transformers, [Paper], [Code]
  • (arXiv 2022.07) MaiT: Leverage Attention Masks for More Efficient Image Transformers, [Paper]
  • (arXiv 2022.07) Dual Vision Transformer, [Paper], [Code]
  • (arXiv 2022.07) Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning, [Paper], [Code]
  • (arXiv 2022.07) Horizontal and Vertical Attention in Transformers, [Paper]
  • (arXiv 2022.07) LightViT: Towards Light-Weight Convolution-Free Vision Transformers, [Paper], [Code]
  • (arXiv 2022.07) Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios, [Paper]
  • (arXiv 2022.07) Image and Model Transformation with Secret Key for Vision Transformer, [Paper]
  • (arXiv 2022.07) Convolutional Bypasses Are Better Vision Transformer Adapters, [Paper]
  • (arXiv 2022.07) Lightweight Vision Transformer with Cross Feature Attention, [Paper]
  • (arXiv 2022.07) Multi-manifold Attention for Vision Transformers, [Paper]
  • (arXiv 2022.07) TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers, [Paper], [Code]
  • (arXiv 2022.07) Locality Guidance for Improving Vision Transformers on Tiny Datasets, [Paper], [Code]
  • (arXiv 2022.07) TinyViT: Fast Pretraining Distillation for Small Vision Transformers, [Paper], [Code]
  • (arXiv 2022.07) Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer, [Paper], [Code]
  • (arXiv 2022.07) An Impartial Take to the CNN vs Transformer Robustness Contest, [Paper]
  • (arXiv 2022.07) Pro-tuning: Unified Prompt Tuning for Vision Tasks, [Paper]
  • (arXiv 2022.08) Semi-supervised Vision Transformers at Scale, [Paper]
  • (arXiv 2022.08) BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers, [Paper], [Code]
  • (arXiv 2022.08) Accelerating Vision Transformer Training via a Patch Sampling Schedule, [Paper], [Code]
  • (arXiv 2022.08) ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition, [Paper], [Code]
  • (arXiv 2022.08) FocusFormer: Focusing on What We Need via Architecture Sampler, [Paper]
  • (arXiv 2022.08) gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window, [Paper]
  • (arXiv 2022.08) Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling, [Paper]
  • (arXiv 2022.08) ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers, [Paper]
  • (arXiv 2022.09) MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition, [Paper]
  • (arXiv 2022.09) A Light Recipe to Train Robust Vision Transformers, [Paper], [Code]
  • (arXiv 2022.09) ConvFormer: Closing the Gap Between CNN and Vision Transformers, [Paper], [Code]
  • (arXiv 2022.09) Axially Expanded Windows for Local-Global Interaction in Vision Transformers, [Paper]
  • (arXiv 2022.09) Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, [Paper]
  • (arXiv 2022.09) Effective Vision Transformer Training: A Data-Centric Perspective, [Paper]
  • (arXiv 2022.09) Dilated Neighborhood Attention Transformer, [Paper], [Code]
  • (arXiv 2022.10) MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features, [Paper], [Code]
  • (arXiv 2022.10) Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs, [Paper]
  • (arXiv 2022.10) Strong Gravitational Lensing Parameter Estimation with Vision Transformer, [Paper], [Code]
  • (arXiv 2022.10) Token-Label Alignment for Vision Transformers, [Paper], [Code]
  • (arXiv 2022.10) Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets, [Paper], [Code]
  • (arXiv 2022.10) Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers, [Paper], [Code]
  • (arXiv 2022.10) Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers, [Paper]
  • (arXiv 2022.10) Curved Representation Space of Vision Transformers, [Paper]
  • (arXiv 2022.10) How to Train Vision Transformer on Small-scale Datasets, [Paper], [Code]
  • (arXiv 2022.10) Vision Transformer Visualization: What Neurons Tell and How Neurons Behave, [Paper], [Code]
  • (arXiv 2022.10) When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture, [Paper], [Code]
  • (arXiv 2022.10) Vision Transformers provably learn spatial structure, [Paper]
  • (arXiv 2022.10) Scratching Visual Transformer's Back with Uniform Attention, [Paper]
  • (arXiv 2022.10) Token Merging: Your ViT But Faster, [Paper], [Code]
  • (arXiv 2022.10) Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets, [Paper], [Code]
  • (arXiv 2022.10) MetaFormer Baselines for Vision, [Paper]
  • (arXiv 2022.10) Learning Explicit Object-Centric Representations with Vision Transformers, [Paper]
  • (arXiv 2022.10) Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets, [Paper], [Code]
  • (arXiv 2022.10) Grafting Vision Transformers, [Paper]
  • (arXiv 2022.10) Differentially Private CutMix for Split Learning with Vision Transformer, [Paper]
  • (arXiv 2022.10) ViT-LSLA: Vision Transformer with Light Self-Limited-Attention, [Paper]
  • (arXiv 2022.11) Rethinking Hierarchicies in Pre-trained Plain Vision Transformer, [Paper], [Code]
  • (arXiv 2022.11) The Lottery Ticket Hypothesis for Vision Transformers, [Paper]
  • (arXiv 2022.11) ViT-CX: Causal Explanation of Vision Transformers, [Paper]
  • (arXiv 2022.11) ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention, [Paper]
  • (arXiv 2022.11) Training a Vision Transformer from scratch in less than 24 hours with 1 GPU, [Paper], [Code]
  • (arXiv 2022.11) Demystify Transformers & Convolutions in Modern Image Deep Networks, [Paper], [Code]
  • (arXiv 2022.11) Token Transformer: Can class token help window-based transformer build better long-range interactions, [Paper]
  • (arXiv 2022.11) CabViT: Cross Attention among Blocks for Vision Transformer, [Paper], [Code]
  • (arXiv 2022.11) BiViT: Extremely Compressed Binary Vision Transformer, [Paper], [Code]
  • (arXiv 2022.11) HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers, [Paper]
  • (arXiv 2022.11) Vision Transformer with Super Token Sampling, [Paper], [Code]
  • (arXiv 2022.11) Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference, [Paper]
  • (arXiv 2022.11) Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition, [Paper]
  • (arXiv 2022.11) TranViT: An Integrated Vision Transformer Framework for Discrete Transit Travel Time Range Prediction, [Paper]
  • (arXiv 2022.11) Gated Class-Attention with Cascaded Feature Drift Compensation for Exemplar-free Continual Learning of Vision Transformers, [Paper], [Code]
  • (arXiv 2022.11) Data Augmentation Vision Transformer for Fine-grained Image Classification, [Paper]
  • (arXiv 2022.11) Integrally Pre-Trained Transformer Pyramid Networks, [Paper], [Code]
  • (arXiv 2022.11) Explanation on Pretraining Bias of Finetuned Vision Transformer, [Paper]
  • (arXiv 2022.11) Adaptive Attention Link-based Regularization for Vision Transformers, [Paper]
  • (arXiv 2022.11) Semantic-Aware Local-Global Vision Transformer, [Paper]
  • (arXiv 2022.11) Pattern Attention Transformer with Doughnut Kernel, [Paper]
  • (arXiv 2022.11) ResFormer: Scaling ViTs with Multi-Resolution Training, [Paper]
  • (arXiv 2022.12) Teaching Matters: Investigating the Role of Supervision in Vision Transformers, [Paper], [Code]
  • (arXiv 2022.12) Group Generalized Mean Pooling for Vision Transformer, [Paper]
  • (arXiv 2022.12) OAMixer: Object-aware Mixing Layer for Vision Transformers, [Paper], [Code]
  • (arXiv 2022.12) What do Vision Transformers Learn? A Visual Exploration, [Paper]
  • (arXiv 2022.12) GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation, [Paper], [Code]
  • (arXiv 2022.12) FlexiViT: One Model for All Patch Sizes, [Paper], [Code]
  • (arXiv 2022.12) Rethinking Vision Transformers for MobileNet Size and Speed, [Paper], [Code]
  • (arXiv 2022.12) Rethinking Cooking State Recognition with Vision Transformers, [Paper]
  • (arXiv 2022.12) What Makes for Good Tokenizers in Vision Transformer, [Paper]
  • (arXiv 2022.12) Local Learning on Transformers via Feature Reconstruction, [Paper]
  • (arXiv 2022.12) Exploring Transformer Backbones for Image Diffusion Models, [Paper]
  • (arXiv 2023.01) TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models, [Paper], [Code]
  • (arXiv 2023.01) Semi-MAE: Masked Autoencoders for Semi-supervised Vision Transformers, [Paper]
  • (arXiv 2023.01) Skip-Attention: Improving Vision Transformers by Paying Less Attention, [Paper]
  • (arXiv 2023.01) Dynamic Grained Encoder for Vision Transformers, [Paper], [Code]
  • (arXiv 2023.01) Image Memorability Prediction with Vision Transformers, [Paper]
  • (arXiv 2023.01) Holistically Explainable Vision Transformers, [Paper]
  • (arXiv 2023.02) DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition, [Paper], [Code]
  • (arXiv 2023.02) KDEformer: Accelerating Transformers via Kernel Density Estimation, [Paper], [Code]
  • (arXiv 2023.02) Reversible Vision Transformers, [Paper], [Code]
  • (arXiv 2023.02) TFormer: A Transmission-Friendly ViT Model for IoT Devices, [Paper]
  • (arXiv 2023.02) Efficiency 360: Efficient Vision Transformers, [Paper]
  • (arXiv 2023.02) ViTA: A Vision Transformer Inference Accelerator for Edge Applications, [Paper]
  • (arXiv 2023.02) CertViT: Certified Robustness of Pre-Trained Vision Transformers, [Paper], [Code]
  • (arXiv 2023.03) Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves, [Paper], [Code]
  • (arXiv 2023.03) Data-Efficient Training of CNNs and Transformers with Coresets: A Stability Perspective, [Paper], [Code]
  • (arXiv 2023.03) A Fast Training-Free Compression Framework for Vision Transformers, [Paper], [Code]
  • (arXiv 2023.03) FFT-based Dynamic Token Mixer for Vision, [Paper], [Code]
  • (arXiv 2023.03) Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models, [Paper], [Code]
  • (arXiv 2023.03) X-Pruner: eXplainable Pruning for Vision Transformers, [Paper]
  • (arXiv 2023.03) CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention, [Paper], [Code]
  • (arXiv 2023.03) Stabilizing Transformer Training by Preventing Attention Entropy Collapse, [Paper]
  • (arXiv 2023.03) Making Vision Transformers Efficient from A Token Sparsification View, [Paper]
  • (arXiv 2023.03) BiFormer: Vision Transformer with Bi-Level Routing Attention, [Paper], [Code]
  • (arXiv 2023.03) ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices, [Paper]
  • (arXiv 2023.03) Robustifying Token Attention for Vision Transformers, [Paper]
  • (arXiv 2023.03) FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization, [Paper]
  • (arXiv 2023.03) Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers, [Paper]
  • (arXiv 2023.03) How Does Attention Work in Vision Transformers? A Visual Analytics Attempt, [Paper]
  • (arXiv 2023.03) SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications, [Paper], [Code]
  • (arXiv 2023.03) Vision Transformer with Quadrangle Attention, [Paper], [Code]
  • (arXiv 2023.04) LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers, [Paper]
  • (arXiv 2023.04) Rethinking Local Perception in Lightweight Vision Transformer, [Paper]
  • (arXiv 2023.04) Vision Transformers with Mixed-Resolution Tokenization, [Paper], [Code]
  • (arXiv 2023.04) Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention, [Paper], [Code]
  • (arXiv 2023.04) PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift, [Paper], [Code]
  • (arXiv 2023.04) SparseFormer: Sparse Visual Recognition via Limited Latent Tokens, [Paper], [Code]
  • (arXiv 2023.04) ViT-Calibrator: Decision Stream Calibration for Vision Transformer, [Paper]
  • (arXiv 2023.04) Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention, [Paper], [Code]
  • (arXiv 2023.04) Life Regression based Patch Slimming for Vision Transformers, [Paper]
  • (arXiv 2023.04) RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer, [Paper], [Code]
  • (arXiv 2023.04) SpectFormer: Frequency and Attention is what you need in a Vision Transformer, [Paper], [Code]
  • (arXiv 2023.04) VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking, [Paper], [Code]
  • (arXiv 2023.04) RSIR Transformer: Hierarchical Vision Transformer using Random Sampling Windows and Important Region Windows, [Paper]
  • (arXiv 2023.04) Dynamic Mobile-Former: Strengthening Dynamic Convolution with Attention and Residual Connection in Kernel Space, [Paper], [Code]
  • (arXiv 2023.04) LipsFormer: Introducing Lipschitz Continuity to Vision Transformers, [Paper], [Code]
  • (arXiv 2023.04) Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers, [Paper], [Code]
  • (arXiv 2023.04) MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer, [Paper], [Code]
  • (arXiv 2023.04) Vision Conformer: Incorporating Convolutions into Vision Transformer Layers, [Paper]
  • (arXiv 2023.05) Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT, [Paper]
  • (arXiv 2023.05) MMViT: Multiscale Multiview Vision Transformers, [Paper]
  • (arXiv 2023.05) AxWin Transformer: A Context-Aware Vision Transformer Backbone with Axial Windows, [Paper]
  • (arXiv 2023.05) Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields, [Paper]
  • (arXiv 2023.05) EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention, [Paper], [Code]
  • (arXiv 2023.05) GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples, [Paper]
  • (arXiv 2023.05) Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation, [Paper]
  • (arXiv 2023.05) CageViT: Convolutional Activation Guided Efficient Vision Transformer, [Paper]
  • (arXiv 2023.05) Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design, [Paper]
  • (arXiv 2023.05) Predicting Token Impact Towards Efficient Vision Transformer, [Paper]
  • (arXiv 2023.05) Dual Path Transformer with Partition Attention, [Paper]
  • (arXiv 2023.05) BinaryViT: Towards Efficient and Accurate Binary Vision Transformers, [Paper]
  • (arXiv 2023.05) Making Vision Transformers Truly Shift-Equivariant, [Paper]
  • (arXiv 2023.05) Concept-Centric Transformers: Concept Transformers with Object-Centric Concept Learning for Interpretability, [Paper], [Code]
  • (arXiv 2023.05) DiffRate : Differentiable Compression Rate for Efficient Vision Transformers, [Paper], [Code]
  • (arXiv 2023.06) Lightweight Vision Transformer with Bidirectional Interaction, [Paper], [Code]
  • (arXiv 2023.06) Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles, [Paper], [Code]
  • (arXiv 2023.06) Bytes Are All You Need: Transformers Operating Directly On File Bytes, [Paper], [Code]
  • (arXiv 2023.06) Muti-Scale And Token Mergence: Make Your ViT More Efficient, [Paper]
  • (arXiv 2023.06) FasterViT: Fast Vision Transformers with Hierarchical Attention, [Paper], [Code]
  • (arXiv 2023.06) E(2)-Equivariant Vision Transformer, [Paper], [Code]
  • (arXiv 2023.06) 2-D SSM: A General Spatial Layer for Visual Transformers, [Paper], [Code]
  • (arXiv 2023.06) Mitigating Transformer Overconfidence via Lipschitz Regularization, [Paper], [Code]
  • (arXiv 2023.06) Reviving Shift Equivariance in Vision Transformers, [Paper]
  • (arXiv 2023.06) Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training, [Paper], [Code]
  • (arXiv 2023.06) Fast Training of Diffusion Models with Masked Transformers, [Paper], [Code]
  • (arXiv 2023.06) RaViTT: Random Vision Transformer Tokens, [Paper]
  • (arXiv 2023.06) Vision Transformer with Attention Map Hallucination and FFN Compaction, [Paper]
  • (arXiv 2023.06) Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing, [Paper]
  • (arXiv 2023.06) Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window, [Paper]
  • (arXiv 2023.06) BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models, [Paper]
  • (arXiv 2023.06) Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing, [Paper], [Code]
  • (arXiv 2023.07) Stitched ViTs are Flexible Vision Backbones, [Paper], [Code]
  • (arXiv 2023.07) MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers, [Paper], [Code]
  • (arXiv 2023.07) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]
  • (arXiv 2023.07) Art Authentication with Vision Transformers, [Paper]
  • (arXiv 2023.07) Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution, [Paper]
  • (arXiv 2023.07) What Happens During Finetuning of Vision Transformers: An Invariance Based Investigation, [Paper]
  • (arXiv 2023.07) Scale-Aware Modulation Meet Transformer, [Paper], [Code]
  • (arXiv 2023.07) RepViT: Revisiting Mobile CNN From ViT Perspective, [Paper], [Code]
  • (arXiv 2023.07) R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut, [Paper]
  • (arXiv 2023.07) Learned Thresholds Token Merging and Pruning for Vision Transformers, [Paper]
  • (arXiv 2023.07) Sparse then Prune: Toward Efficient Vision Transformers, [Paper], [Code]
  • (arXiv 2023.07) Sparse Double Descent in Vision Transformers: real or phantom threat?, [Paper]
  • (arXiv 2023.07) Adaptive Frequency Filters As Efficient Global Token Mixers, [Paper]
  • (arXiv 2023.07) E2VPT: An Effective and Efficient Approach for Visual Prompt Tuning, [Paper], [Code]
  • (arXiv 2023.07) Pre-training Vision Transformers with Very Limited Synthesized Images, [Paper], [Code]
  • (arXiv 2023.08) LGViT: Dynamic Early Exiting for Accelerating Vision Transformer, [Paper]
  • (arXiv 2023.08) Performance Evaluation of Swin Vision Transformer Model using Gradient Accumulation Optimization Technique, [Paper]
  • (arXiv 2023.08) FLatten Transformer: Vision Transformer using Focused Linear Attention, [Paper], [Code]
  • (arXiv 2023.08) A Multidimensional Analysis of Social Biases in Vision Transformers, [Paper]
  • (arXiv 2023.08) Which Tokens to Use? Investigating Token Reduction in Vision Transformers, [Paper], [Code]
  • (arXiv 2023.08) DiT: Efficient Vision Transformers with Dynamic Token Routing, [Paper], [Code]
  • (arXiv 2023.08) Revisiting Vision Transformer from the View of Path Ensemble, [Paper]
  • (arXiv 2023.08) Patch Is Not All You Need, [Paper]
  • (arXiv 2023.08) ConcatPlexer: Additional Dim1 Batching for Faster ViTs, [Paper], [Code]
  • (arXiv 2023.08) SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation, [Paper], [Code]
  • (arXiv 2023.08) SG-Former: Self-guided Transformer with Evolving Token Reallocation, [Paper], [Code]
  • (arXiv 2023.08) Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers, [Paper]
  • (arXiv 2023.08) Learning Diverse Features in Vision Transformers for Improved Generalization, [Paper]
  • (arXiv 2023.09) DAT++: Spatially Dynamic Vision Transformer with Deformable Attention, [Paper], [Code]
  • (arXiv 2023.09) ExMobileViT: Lightweight Classifier Extension for Mobile Vision Transformer, [Paper]
  • (arXiv 2023.09) Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts, [Paper]
  • (arXiv 2023.09) CNN or ViT? Revisiting Vision Transformers Through the Lens of Convolution, [Paper], [Code]
  • (arXiv 2023.09) SparseSwin: Swin Transformer with Sparse Transformer Block, [Paper], [Code]
  • (arXiv 2023.09) DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices, [Paper]
  • (arXiv 2023.09) Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit, [Paper], [Code]
  • (arXiv 2023.09) Interpretability-Aware Vision Transformer, [Paper]
  • (arXiv 2023.09) Replacing softmax with ReLU in Vision Transformers, [Paper]
  • (arXiv 2023.09) Interpret Vision Transformers as ConvNets with Dynamic Convolutions, [Paper]
  • (arXiv 2023.09) RMT: Retentive Networks Meet Vision Transformers, [Paper]
  • (arXiv 2023.09) DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion, [Paper]
  • (arXiv 2023.09) Associative Transformer Is A Sparse Representation Learner, [Paper]
  • (arXiv 2023.09) Masked Image Residual Learning for Scaling Deeper Vision Transformers, [Paper]
  • (arXiv 2023.09) Efficient Low-rank Backpropagation for Vision Transformer Adaptation, [Paper]
  • (arXiv 2023.09) Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words, [Paper]
  • (arXiv 2023.09) Vision Transformers Need Registers, [Paper]
  • (arXiv 2023.10) PPT: Token Pruning and Pooling for Efficient Vision Transformers, [Paper]
  • (arXiv 2023.10) Selective Feature Adapter for Dense Vision Transformers, [Paper]
  • (arXiv 2023.10) GET: Group Event Transformer for Event-Based Vision, [Paper], [Code]
  • (arXiv 2023.10) ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer, [Paper]
  • (arXiv 2023.10) SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers, [Paper], [Code]
  • (arXiv 2023.10) TiC: Exploring Vision Transformer in Convolution, [Paper], [Code]
  • (arXiv 2023.10) Sub-token ViT Embedding via Stochastic Resonance Transformers, [Paper]
  • (arXiv 2023.10) No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling, [Paper]
  • (arXiv 2023.10) Plug n' Play: Channel Shuffle Module for Enhancing Tiny Vision Transformers, [Paper]
  • (arXiv 2023.10) Hierarchical Side-Tuning for Vision Transformers, [Paper], [Code]
  • (arXiv 2023.10) EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention, [Paper], [Code]
  • (arXiv 2023.10) Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing, [Paper], [Code]
  • (arXiv 2023.10) Accelerating Vision Transformers Based on Heterogeneous Attention Patterns, [Paper]
  • (arXiv 2023.10) MatFormer: Nested Transformer for Elastic Inference, [Paper]
  • (arXiv 2023.10) Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems, [Paper]
  • (arXiv 2023.10) ConvNets Match Vision Transformers at Scale, [Paper]
  • (arXiv 2023.10) MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory, [Paper], [Code]
  • (arXiv 2023.10) Analyzing Vision Transformers for Image Classification in Class Embedding Space, [Paper]
  • (arXiv 2023.11) Improving Robustness for Vision Transformer with a Simple Dynamic Scanning Augmentation, [Paper]
  • (arXiv 2023.11) Scattering Vision Transformer: Spectral Mixing Matters, [Paper], [Project]
  • (arXiv 2023.11) GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values, [Paper]
  • (arXiv 2023.11) Mini but Mighty: Finetuning ViTs with Mini Adapters, [Paper], [Code]
  • (arXiv 2023.11) A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis, [Paper], [Code]
  • (arXiv 2023.11) SBCFormer: Lightweight Network Capable of Full-size ImageNet Classification at 1 FPS on Single Board Computers, [Paper], [Code]
  • (arXiv 2023.11) FMViT: A multiple-frequency mixing Vision Transformer, [Paper], [Code]
  • (arXiv 2023.11) Cross-Axis Transformer with 2D Rotary Embeddings, [Paper]
  • (arXiv 2023.11) Aggregate, Decompose, and Fine-Tune: A Simple Yet Effective Factor-Tuning Method for Vision Transformer, [Paper], [Code]
  • (arXiv 2023.11) Advancing Vision Transformers with Group-Mix Attention, [Paper], [Code]
  • (arXiv 2023.11) Token Recycling for Efficient Sequential Inference with Vision Transformers, [Paper]
  • (arXiv 2023.11) TransNeXt: Robust Foveal Visual Perception for Vision Transformers, [Paper]
  • (arXiv 2023.11) Stochastic Vision Transformers with Wasserstein Distance-Aware Attention, [Paper]
  • (arXiv 2023.11) Improving Faithfulness for Vision Transformers, [Paper]
  • (arXiv 2023.11) SCHEME: Scalable Channer Mixer for Vision Transformers, [Paper]
  • (arXiv 2023.12) MABViT -- Modified Attention Block Enhances Vision Transformers, [Paper]
  • (arXiv 2023.12) Class-Discriminative Attention Maps for Vision Transformers, [Paper], [Code]
  • (arXiv 2023.12) Factorization Vision Transformer: Modeling Long Range Dependency with Local Window Cost, [Paper], [Code]
  • (arXiv 2023.12) Weight Subcloning: Direct Initialization of Transformers Using Larger Pretrained Ones, [Paper]
  • (arXiv 2023.12) Cached Transformers: Improving Transformers with Differentiable Memory Cache, [Paper]
  • (arXiv 2023.12) Partial Fine-Tuning: A Successor to Full Fine-Tuning for Vision Transformers, [Paper]
  • (arXiv 2023.12) Merging Vision Transformers from Different Tasks and Domains, [Paper]
  • (arXiv 2023.12) Universal Pyramid Adversarial Training for Improved ViT Performance, [Paper]
  • (arXiv 2024.01) Token Propagation Controller for Efficient Vision Transformer, [Paper]

Clustering

  • (arXiv 2022.06) Vision Transformer for Contrastive Clustering, [Paper]
  • (arXiv 2023.04) Fairness in Visual Clustering: A Novel Transformer Clustering Approach, [Paper]
  • (arXiv 2023.06) Dynamic Clustering Transformer Network for Point Cloud Segmentation, [Paper]

Completion

  • (arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]
  • (arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]
  • (arXiv 2023.03) FishDreamer: Towards Fisheye Semantic Completion via Unified Image Outpainting and Segmentation, [Paper], [Code]
  • (arXiv 2023.04) Contour Completion by Transformers and Its Application to Vector Font Data, [Paper]
  • (arXiv 2023.07) CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion, [Paper]
  • (arXiv 2023.10) Distance-based Weighted Transformer Network for Image Completion, [Paper]
  • (arXiv 2024.01) CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers, [Paper],[Code]

Compression

  • (arXiv 2021.10) Accelerating Framework of Transformer by hardware Design and Model Compression Co-Optimization, [Paper]
  • (arXiv 2021.11) Transformer-based Image Compression, [Paper]
  • (arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper], [Code]
  • (arXiv 2021.12) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]
  • (arXiv 2022.01) Multi-Dimensional Model Compression of Vision Transformer, [Paper]
  • (arXiv 2022.02) Entroformer: A Transformer-based Entropy Model for Learned Image Compression, [Paper], [Code]
  • (arXiv 2022.03) Unified Visual Transformer Compression, [Paper], [Code]
  • (arXiv 2022.03) Transformer Compressed Sensing via Global Image Tokens, [Paper], [supplementary]
  • (arXiv 2022.03) Vision Transformer Compression with Structured Pruning and Low Rank Approximation, [Paper]
  • (arXiv 2022.04) Searching Intrinsic Dimensions of Vision Transformers, [Paper]
  • (arXiv 2022.04) Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging, [Paper]
  • (arXiv 2022.06) VCT: A Video Compression Transformer, [Paper], [Code]
  • (arXiv 2022.07) TransCL: Transformer Makes Strong and Flexible Compressive Learning, [Paper], [Code]
  • (arXiv 2022.08) Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation, [Paper], [Code]
  • (arXiv 2022.08) Unified Normalization for Accelerating and Stabilizing Transformers, [Paper], [Code]
  • (arXiv 2022.09) Uformer-ICS: A Specialized U-Shaped Transformer for Image Compressive Sensing, [Paper]
  • (arXiv 2022.09) Attacking Compressed Vision Transformers, [Paper]
  • (arXiv 2023.01) GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer, [Paper]
  • (arXiv 2023.03) SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage, [Paper], [Code]
  • (arXiv 2023.03) Learned Image Compression with Mixed Transformer-CNN Architectures, [Paper], [Code]
  • (arXiv 2023.04) Optimization-Inspired Cross-Attention Transformer for Compressive Sensing, [Paper], [Code]
  • (arXiv 2023.05) ROI-based Deep Image Compression with Swin Transformers, [Paper]
  • (arXiv 2023.05) Transformer-based Variable-rate Image Compression with Region-of-interest Control, [Paper]
  • (arXiv 2023.06) Efficient Contextformer: Spatio-Channel Window Attention for Fast Context Modeling in Learned Image Compression, [Paper]
  • (arXiv 2023.07) AICT: An Adaptive Image Compression Transformer, [Paper]
  • (arXiv 2023.07) JPEG Quantized Coefficient Recovery via DCT Domain Spatial-Frequential Transformer, [Paper]
  • (arXiv 2023.09) Compressing Vision Transformers for Low-Resource Visual Learning, [Paper]
  • (arXiv 2023.09) CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs, [Paper]
  • (arXiv 2023.10) USDC: Unified Static and Dynamic Compression for Visual Transformer, [Paper]
  • (arXiv 2023.10) Frequency-Aware Transformer for Learned Image Compression, [Paper]
  • (arXiv 2023.11) White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is, [Paper], [Code]
  • (arXiv 2023.11) Corner-to-Center Long-range Context Model for Efficient Learned Image Compression, [Paper]
  • (arXiv 2023.12) Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks, [Paper], [Code]
  • (arXiv 2024.01) UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer, [Paper]

Cross-view

  • (arXiv 2022.03) Mutual Generative Transformer Learning for Cross-view Geo-localization, [Paper]
  • (arXiv 2022.04) TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization, [Paper], [Code]

Crowd

  • (arXiv 2021.04) TransCrowd: Weakly-Supervised Crowd Counting with Transformer, [Paper], [Code]
  • (arXiv 2021.05) Boosting Crowd Counting with Transformers, [Paper], [Code]
  • (arXiv 2021.08) Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer, [Paper]
  • (arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper], [Code]
  • (arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]
  • (arXiv 2022.01) Scene-Adaptive Attention Network for Crowd Counting, [Paper]
  • (arXiv 2022.03) An End-to-End Transformer Model for Crowd Localization, [Paper]
  • (arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting, [Paper]
  • (arXiv 2022.06) Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation, [Paper]
  • (arXiv 2022.08) CounTR: Transformer-based Generalised Visual Counting, [Paper]
  • (arXiv 2023.01) RGB-T Multi-Modal Crowd Counting Based on Transformer, [Paper], [Code]
  • (arXiv 2023.03) InCrowdFormer: On-Ground Pedestrian World Model From Egocentric Views, [Paper]
  • (arXiv 2023.05) Selecting Learnable Training Samples is All DETRs Need in Crowded Pedestrian Detection, [Paper]
  • (arXiv 2023.10) Query-adaptive DETR for Crowded Pedestrian Detection, [Paper]
  • (arXiv 2023.12) Regressor-Segmenter Mutual Prompt Learning for Crowd Counting, [Paper]
  • (arXiv 2024.01) Gramformer: Learning Crowd Counting via Graph-Modulated Transformer, [Paper], [Code]

Deblurring

  • (arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring,[Paper]
  • (arXiv 2022.04) Stripformer: Strip Transformer for Fast Image Deblurring, [Paper]
  • (arXiv 2022.04) VDTR: Video Deblurring with Transformer, [Paper], [Code]
  • (arXiv 2022.09) DMTNet: Dynamic Multi-scale Network for Dual-pixel Images Defocus Deblurring with Transformer, [Paper]
  • (arXiv 2022.11) Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring, [Paper], [Code]
  • (arXiv 2023.03) Image Deblurring by Exploring In-depth Properties of Transformer, [Paper]
  • (arXiv 2023.09) Aggregating Long-term Sharp Features via Hybrid Transformers for Video Deblurring, [Paper], [Code]

Depth

  • (arXiv 2020.11) Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
  • (arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]
  • (arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]
  • (arXiv 2021.09) Improving 360 Monocular Depth Estimation via Non-local Dense Prediction Transformer and Joint Supervised and Self-supervised Learning, [Paper]
  • (arXiv 2022.02) GLPanoDepth: Global-to-Local Panoramic Depth Estimation, [Paper]
  • (arXiv 2022.02) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper]
  • (arXiv 2022.03) OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion, [Paper]
  • (arXiv 2022.03) PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation, [Paper]
  • (arXiv 2022.03) DepthGAN: GAN-based Depth Generation of Indoor Scenes from Semantic Layouts, [Paper]
  • (arXiv 2022.03) DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation, [Paper], [Code]
  • (arXiv 2022.04) BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation, [Paper], [Code]
  • (arXiv 2022.04) SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation, [Paper], [Project]
  • (arXiv 2022.04) Multi-Frame Self-Supervised Depth with Transformers, [Paper], [Project]
  • (arXiv 2022.05) SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation, [Paper]
  • (arXiv 2022.05) Depth Estimation with Simplified Transformer, [Paper]
  • (arXiv 2022.05) MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers, [Paper]
  • (arXiv 2022.06) SparseFormer: Attention-based Depth Completion Network, [Paper]
  • (arXiv 2022.06) Forecasting of depth and ego-motion with transformers and self-supervision, [Paper]
  • (arXiv 2022.07) Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion, [Paper], [Code]
  • (arXiv 2022.08) MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer, [Paper], [Code]
  • (arXiv 2022.09) TODE-Trans: Transparent Object Depth Estimation with Transformer, [Paper], [Code]
  • (arXiv 2022.10) Context-Enhanced Stereo Transformer, [Paper], [Code]
  • (arXiv 2022.11) Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation, [Paper]
  • (arXiv 2022.11) Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation, [Paper], [Code]
  • (arXiv 2022.12) Event-based Monocular Dense Depth Estimation with Recurrent Transformers, [Paper]
  • (arXiv 2022.12) ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation, [Paper]
  • (arXiv 2023.01) Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth Estimation in Dynamic Scenes, [Paper]
  • (arXiv 2023.01) SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via Swin Transformer and Densely Cascaded Network, [Paper]
  • (arXiv 2023.02) URCDC-Depth: Uncertainty Rectified Cross-Distillation with CutFlip for Monocular Depth Estimation, [Paper], [Code]
  • (arXiv 2023.03) STDepthFormer: Predicting Spatio-temporal Depth from Video with a Self-supervised Transformer Model, [Paper], [Code]
  • (arXiv 2023.03) DwinFormer: Dual Window Transformers for End-to-End Monocular Depth Estimation, [Paper]
  • (arXiv 2023.03) DEHRFormer: Real-time Transformer for Depth Estimation and Haze Removal from Varicolored Haze Scenes, [Paper]
  • (arXiv 2023.03) Channel-Aware Distillation Transformer for Depth Estimation on Nano Drones, [Paper]
  • (arXiv 2023.04) EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation, [Paper]
  • (arXiv 2023.04) CompletionFormer: Depth Completion with Convolutions and Vision Transformers, [Paper], [Code]
  • (arXiv 2023.08) Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN, [Paper]
  • (arXiv 2023.08) Semi-Supervised Semantic Depth Estimation using Symbiotic Transformer and NearFarMix Augmentation, [Paper]
  • (arXiv 2023.09) SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation, [Paper], [Code]
  • (arXiv 2023.10) GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation, [Paper]
  • (arXiv 2023.10) FocDepthFormer: Transformer with LSTM for Depth Estimation from Focus, [Paper]
  • (arXiv 2023.10) Metrically Scaled Monocular Depth Estimation through Sparse Priors for Underwater Robots, [Paper], [Code]
  • (arXiv 2023.12) Transformers in Unsupervised Structure-from-Motion, [Paper], [Code]

Deepfake Detection

  • (arXiv.2021.02) Deepfake Video Detection Using Convolutional Vision Transformer, [Paper]
  • (arXiv 2021.04) Deepfake Detection Scheme Based on Vision Transformer and Distillation, [Paper]
  • (arXiv 2021.04) M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection, [Paper]
  • (arXiv 2021.07) Combining EfficientNet and Vision Transformers for Video Deepfake Detection, [Paper]
  • (arXiv 2021.08) Video Transformer for Deepfake Detection with Incremental Learning, [Paper]
  • (arXiv 2022.03) Self-supervised Transformer for Deepfake Detection, [Paper], [Code]
  • (arXiv 2022.06) Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection, [Paper]
  • (arXiv 2022.07) Deepfake Video Detection with Spatiotemporal Dropout Transformer, [Paper]
  • (arXiv 2022.07) Hybrid Transformer Network for Deepfake Detection, [Paper]
  • (arXiv 2022.09) Deep Convolutional Pooling Transformer for Deepfake Detection, [Paper]
  • (arXiv 2023.04) Deepfake Detection with Deep Learning: Convolutional Neural Networks versus Transformers, [Paper]
  • (arXiv 2023.07) Deepfake Video Detection Using Generative Convolutional Vision Transformer, [Paper], [Code]
  • (arXiv 2023.07) Self-Supervised Graph Transformer for Deepfake Detection, [Paper]
  • (arXiv 2023.09) DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention, [Paper]

Diffusion

  • (arXiv 2022.12) Scalable Diffusion Models with Transformers, [Paper], [Code]
  • (arXiv 2023.03) Masked Diffusion Transformer is a Strong Image Synthesizer, [Paper], [Code]
  • (arXiv 2023.04) ViT-DAE: Transformer-driven Diffusion Autoencoder for Histopathology Image Analysis, [Paper]
  • (arXiv 2023.06) DFormer: Diffusion-guided Transformer for Universal Image Segmentation, [Paper], [Code]
  • (arXiv 2023.08) Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers, [Paper]
  • (arXiv 2023.09) Large-Vocabulary 3D Diffusion Model with Transformer, [Paper], [Project]
  • (arXiv 2023.09) Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models, [Paper], [Project]
  • (arXiv 2023.12) DiffiT: Diffusion Vision Transformers for Image Generation, [Paper]
  • (arXiv 2023.12) DiT-Head: High-Resolution Talking Head Synthesis using Diffusion Transformers, [Paper]

Dehazing

  • (arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]
  • (arXiv 2022.04) Vision Transformers for Single Image Dehazing, [Paper]
  • (arXiv 2022.10) Semi-UFormer: Semi-supervised Uncertainty-aware Transformer for Image Dehazing, [Paper]
  • (arXiv 2023.03) SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency, [Paper]
  • (arXiv 2023.04) A Data-Centric Solution to NonHomogeneous Dehazing via Vision Transformer, [Paper], [Code]
  • (arXiv 2023.05) NightHazeFormer: Single Nighttime Haze Removal Using Prior Query Transformer, [Paper]
  • (arXiv 2023.08) MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing, [Paper], [Code]
  • (arXiv 2023.12) DHFormer: A Vision Transformer-Based Attention Module for Image Dehazing, [Paper]
  • (arXiv 2024.01) WaveletFormerNet: A Transformer-based Wavelet Network for Real-world Non-homogeneous and Dense Fog Removal, [Paper]

Deraining

  • (arXiv 2022.04) DRT: A Lightweight Single Image Deraining Recursive Transformer, [Paper]
  • (arXiv 2022.07) Magic ELF: Image Deraining Meets Association Learning and Transformer, [Paper], [Code]
  • (arXiv 2023.03) Learning A Sparse Transformer Network for Effective Image Deraining, [Paper], [Code]
  • (arXiv 2023.08) Learning Image Deraining Transformer Network with Dynamic Dual Self-Attention, [Paper]
  • (arXiv 2023.08) Sparse Sampling Transformer with Uncertainty-Driven Ranking for Unified Removal of Raindrops and Rain Streaks, [Paper]
  • (arXiv 2024.01) NightRain: Nighttime Video Deraining via Adaptive-Rain-Removal and Adaptive-Correction, [Paper]

Denoising

  • (arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]
  • (arXiv 2022.03) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper], [Code]
  • (arXiv 2022.03) Practical Blind Denoising via Swin-Conv-UNet and Data Synthesis, [Paper], [Code]
  • (arXiv 2022.05) Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer, [Paper]
  • (arXiv 2022.05) Dense residual Transformer for image denoising, [Paper]
  • (arXiv 2022.07) DnSwin: Toward Real-World Denoising via Continuous Wavelet Sliding-Transformer, [Paper]
  • (arXiv 2022.11) Spatial-Spectral Transformer for Hyperspectral Image Denoising, [Paper], [Code]
  • (arXiv 2023.03) Xformer: Hybrid X-Shaped Transformer for Image Denoising, [Paper]
  • (arXiv 2023.03) Hybrid Spectral Denoising Transformer with Learnable Query, [Paper]
  • (arXiv 2023.04) Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising, [Paper], [Code]
  • (arXiv 2023.04) Exploration of Lightweight Single Image Denoising with Transformers and Truly Fair Training, [Paper], [Code]
  • (arXiv 2023.04) Self-Supervised Image Denoising for Real-World Images with Context-aware Transformer, [Paper]
  • (arXiv 2023.04) DDT: Dual-branch Deformable Transformer for Image Denoising, [Paper], [Code]
  • (arXiv 2023.04) EWT: Efficient Wavelet-Transformer for Single Image Denoising, [Paper]
  • (arXiv 2023.04) NoiseTrans: Point Cloud Denoising with Transformers, [Paper]
  • (arXiv 2023.05) RViDeformer: Efficient Raw Video Denoising Transformer with a Larger Benchmark Dataset, [Paper]
  • (arXiv 2023.05) Degradation-Noise-Aware Deep Unfolding Transformer for Hyperspectral Image Denoising, [Paper]
  • (arXiv 2023.10) Physics-guided Noise Neural Proxy for Low-light Raw Image Denoising, [Paper]
  • (arXiv 2023.10) A cross Transformer for image denoising, [Paper], [Code]
  • (arXiv 2023.10) Complex Image Generation SwinTransformer Network for Audio Denoising, [Paper]
  • (arXiv 2024.01) Denoising Vision Transformers, [Paper], [Code]
  • (arXiv 2024.01) Hyperspectral Image Denoising via Spatial-Spectral Recurrent Transformer, [Paper], [Code]

Detection

  • (ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]
  • (ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]
  • (CVPR'21) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper], [Code]
  • (arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]
  • (arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]
  • (arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]
  • (arXiv 2020.12) DETR for Pedestrian Detection, [Paper]
  • (arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]
  • (arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]
  • (arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]
  • (arXiv 2021.03) SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving, [Paper]
  • (arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]
  • (arXiv 2021.03) CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, [Paper]
  • (arXiv 2021.03) DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention, [Paper]
  • (arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]
  • (arXiv 2021.04) Points as Queries: Weakly Semi-supervised Object Detection by Points, [Paper]
  • (arXiv 2021.04) CAT: Cross-Attention Transformer for One-Shot Object Detection, [Paper]
  • (arXiv 2021.05) Content-Augmented Feature Pyramid Network with Light Linear Transformers, [Paper]
  • (arXiv 2021.06) You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection, [Paper]
  • (arXiv 2021.06) DETReg: Unsupervised Pretraining with Region Priors for Object Detection, [Paper],[Project]
  • (arXiv 2021.06) Oriented Object Detection with Transformer, [Paper]
  • (arXiv 2021.06) MODETR: Moving Object Detection with Transformers, [Paper]
  • (arXiv 2021.07) ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer, [Paper]
  • (arXiv 2021.07) OODformer: Out-Of-Distribution Detection Transformer, [Paper]
  • (arXiv 2021.07) Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers, [Paper],[Code]
  • (arXiv 2021.08) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper],[Code]
  • (arXiv 2021.08) PSViT: Better Vision Transformer via Token Pooling and Attention Sharing, [Paper]
  • (arXiv 2021.08) Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation), [Paper],[Code]
  • (arXiv 2021.08) Conditional DETR for Fast Training Convergence, [Paper],[Code]
  • (arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]
  • (arXiv 2021.08) TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios, [Paper]
  • (arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper],[Code]
  • (arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]
  • (arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]
  • (arXiv 2021.10) IViDT: An Efficient and Effective Fully Transformer-based Object Detector, [Paper],[Code]
  • (arXiv 2021.10) DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries, [Paper],[Code]
  • (arXiv 2021.10) CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector, [Paper],[Code]
  • (arXiv 2021.11) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper],[Code]
  • (arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]
  • (arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]
  • (arXiv 2021.11) Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity, [Paper], [Code]
  • (arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]
  • (arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]
  • (arXiv 2021.12) BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View, [Paper]
  • (arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]
  • (arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]
  • (arXiv 2022.01) DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR, [Paper], [Code]
  • (arXiv 2022.03) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising, [Paper], [Code]
  • (arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, [Paper], [Code]
  • (arXiv 2022.03) Knowledge Amalgamation for Object Detection with Transformers, [Paper]
  • (arXiv 2022.03) Accelerating DETR Convergence via Semantic-Aligned Matching, [Paper], [Code]
  • (arXiv 2022.03) Progressive End-to-End Object Detection in Crowded Scenes, [Paper], [Code]
  • (arXiv 2022.03) Towards Data-Efficient Detection Transformers, [Paper], [Code]
  • (arXiv 2022.03) Semantic-aligned Fusion Transformer for One-shot Object Detection, [Paper]
  • (arXiv 2022.03) MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer, [Paper]
  • (arXiv 2022.03) TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers, [Paper], [Code]
  • (arXiv 2022.03) Open-Vocabulary DETR with Conditional Matching, [Paper]
  • (arXiv 2022.03) MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection, [Paper], [Code]
  • (arXiv 2022.03) Few-Shot Object Detection with Fully Cross-Transformer, [Paper]
  • (arXiv 2022.03) Exploring Plain Vision Transformer Backbones for Object Detection, [Paper]
  • (arXiv 2022.03) Omni-DETR: Omni-Supervised Object Detection with Transformers, [Paper], [Code]
  • (arXiv 2022.04) CAT-Det: Contrastively Augmented Transformer for Multi-modal 3D Object Detection, [Paper]
  • (arXiv 2022.04) Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection, [Paper], [Code]
  • (arXiv 2022.04) An Extendable, Efficient and Effective Transformer-based Object Detector, [Paper], [Code]
  • (arXiv 2022.04) Learning Future Object Prediction with a Spatiotemporal Detection Transformer, [Paper]
  • (arXiv 2022.04) DFAM-DETR: Deformable feature based attention mechanism DETR on slender object detection, [Paper]
  • (arXiv 2022.04) Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection, [Paper]
  • (arXiv 2022.05) Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning, [Paper]
  • (arXiv 2022.05) An Empirical Study of Self-supervised Learning Approaches for Object Detection with Transformers, [Paper], [Code1], [Code2]
  • (arXiv 2022.05) Simple Open-Vocabulary Object Detection with Vision Transformers, [Paper], [Code]
  • (arXiv 2022.05) Vision Transformer Adapter for Dense Predictions, [Paper], [Code]
  • (arXiv 2022.05) Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection, [Paper]
  • (arXiv 2022.05) Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer, [Paper],[Code]
  • (arXiv 2022.05) Transformer-based out-of-distribution detection for clinically safe segmentation, [Paper]
  • (arXiv 2022.05) AO2-DETR: Arbitrary-Oriented Object Detection Transformer, [Paper],[Code]
  • (arXiv 2022.06) Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation, [Paper],[Code]
  • (arXiv 2022.06) DETR++: Taming Your Multi-Scale Detection Transformer, [Paper],[Code]
  • (arXiv 2022.06) Visual Transformer for Object Detection, [Paper]
  • (arXiv 2022.06) Efficient Decoder-free Object Detection with Transformers, [Paper]
  • (arXiv 2022.07) QKVA grid: Attention in Image Perspective and Stacked DETR, [Paper],[Code]
  • (arXiv 2022.07) Symmetry-Aware Transformer-based Mirror Detection, [Paper],[Code]
  • (arXiv 2022.07) Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection, [Paper]
  • (arXiv 2022.07) Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection, [Paper]
  • (arXiv 2022.07) Conditional DETR V2: Efficient Detection Transformer with Box Queries, [Paper]
  • (arXiv 2022.07) Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment, [Paper]
  • (arXiv 2022.07) DETRs with Hybrid Matching, [Paper], [Code]
  • (arXiv 2022.07) Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion, [Paper],[Code]
  • (arXiv 2022.08) An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection, [Paper]
  • (arXiv 2022.08) Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors, [Paper], [Project]
  • (arXiv 2022.08) Swin-transformer-yolov5 For Real-time Wine Grape Bunch Detection, [Paper]
  • (arXiv 2022.09) SEFormer: Structure Embedding Transformer for 3D Object Detection, [Paper]
  • (arXiv 2022.09) Vision Transformers and YoloV5 based Driver Drowsiness Detection Framework, [Paper]
  • (arXiv 2022.09) Sar Ship Detection based on Swin Transformer and Feature Enhancement Feature Pyramid Network, [Paper]
  • (arXiv 2022.09) CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection, [Paper],[Code]
  • (arXiv 2022.09) A lightweight Transformer-based model for fish landmark detection, [Paper]
  • (arXiv 2022.09) ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers, [Paper]
  • (arXiv 2022.09) CenterFormer: Center-based Transformer for 3D Object Detection, [Paper], [Project]
  • (arXiv 2022.09) CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer, [Paper]
  • (arXiv 2022.10) Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning, [Paper]
  • (arXiv 2022.10) MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving, [Paper]
  • (arXiv 2022.10) Li3DeTr: A LiDAR based 3D Detection Transformer, [Paper]
  • (arXiv 2022.10) Pair DETR: Contrastive Learning Speeds Up DETR Training, [Paper]
  • (arXiv 2022.11) SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency, [Paper]
  • (arXiv 2022.11) Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining, [Paper]
  • (arXiv 2022.11) Teach-DETR: Better Training DETR with Teachers, [Paper],[Code]
  • (arXiv 2022.11) DETRs with Collaborative Hybrid Assignments Training, [Paper],[Code]
  • (arXiv 2022.11) DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding, [Paper],[Code]
  • (arXiv 2022.11) How to Backpropagate through Hungarian in Your DETR, [Paper]
  • (arXiv 2022.11) Concealed Object Detection for Passive Millimeter-Wave Security Imaging Based on Task-Aligned Detection Transformer, [Paper]
  • (arXiv 2022.12) Recurrent Vision Transformers for Object Detection with Event Cameras, [Paper]
  • (arXiv 2022.12) CNN-transformer mixed model for object detection, [Paper]
  • (arXiv 2022.12) DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention, [Paper]
  • (arXiv 2022.12) GPTR: Gestalt-Perception Transformer for Diagram Object Detection, [Paper]
  • (arXiv 2023.01) Dynamic Background Reconstruction via Transformer for Infrared Small Target Detection, [Paper]
  • (arXiv 2023.01) Learning to View: Decision Transformers for Active Object Detection, [Paper]
  • (arXiv 2023.01) Aerial Image Object Detection With Vision Transformer Detector, [Paper]
  • (arXiv 2023.01) Priors are Powerful: Improving a Transformer for Multi-camera 3D Detection with 2D Priors, [Paper]
  • (arXiv 2023.01) IH-ViT: Vision Transformer-based Integrated Circuit Appear-ance Defect Detection, [Paper]
  • (arXiv 2023.02) Team-DETR: Guide Queries as a Professional Team in Detection Transformers, [Paper],[Code]
  • (arXiv 2023.02) Hyneter: Hybrid Network Transformer for Object Detection, [Paper]
  • (arXiv 2023.02) KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer, [Paper],[Code]
  • (arXiv 2023.03) D2Q-DETR: Decoupling and Dynamic Queries for Oriented Object Detection with Transformers, [Paper]
  • (arXiv 2023.03) FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation, [Paper],[Code]
  • (arXiv 2023.03) A Computer Vision Enabled damage detection model with improved YOLOv5 based on Transformer Prediction Head, [Paper]
  • (arXiv 2023.03) ARS-DETR: Aspect Ratio Sensitive Oriented Object Detection with Transformer, [Paper],[Code]
  • (arXiv 2023.03) Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR, [Paper],[Code]
  • (arXiv 2023.03) FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors, [Paper]
  • (arXiv 2023.03) Query-guided Attention in Vision Transformers for Localizing Objects Using a Single Sketch, [Paper]
  • (arXiv 2023.03) SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object Detection with Transformers, [Paper]
  • (arXiv 2023.03) MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer, [Paper]
  • (arXiv 2023.03) Transformer-based Multi-Instance Learning for Weakly Supervised Object Detection, [Paper]
  • (arXiv 2023.03) Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers, [Paper],[Code]
  • (arXiv 2023.03) T-FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC Radar Signals, [Paper]
  • (arXiv 2023.03) SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer, [Paper],[Code]
  • (arXiv 2023.04) Siamese DETR, [Paper],[Code]
  • (arXiv 2023.04) Training Strategies for Vision Transformers for Object Detection, [Paper]
  • (arXiv 2023.04) Language-aware Multiple Datasets Detection Pretraining for DETRs, [Paper]
  • (arXiv 2023.04) Detection Transformer with Stable Matching, [Paper],[Code]
  • (arXiv 2023.04) Use the Detection Transformer as a Data Augmenter, [Paper]
  • (arXiv 2023.04) DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object Detection, [Paper]
  • (arXiv 2023.04) DETRs Beat YOLOs on Real-time Object Detection, [Paper]
  • (arXiv 2023.04) DETR-based Layered Clothing Segmentation and Fine-Grained Attribute Recognition, [Paper]
  • (arXiv 2023.04) Align-DETR: Improving DETR with Simple IoU-aware BCE loss, [Paper],[Code]
  • (arXiv 2023.04) Transformer-based stereo-aware 3D object detection from binocular images, [Paper]
  • (arXiv 2023.05) End to End Lane detection with One-to-Several Transformer, [Paper],[Code]
  • (arXiv 2023.05) TransCAR: Transformer-based Camera-And-Radar Fusion for 3D Object Detection, [Paper],[Code]
  • (arXiv 2023.05) SSD-MonoDTR: Supervised Scale-constrained Deformable Transformer for Monocular 3D Object Detection, [Paper],[Code]
  • (arXiv 2023.05) RHINO: Rotated DETR with Dynamic Denoising via Hungarian Matching for Oriented Object Detection, [Paper]
  • (arXiv 2023.06) detrex: Benchmarking Detection Transformers, [Paper],[Code]
  • (arXiv 2023.06) Revisiting Token Pruning for Object Detection and Instance Segmentation, [Paper]
  • (arXiv 2023.06) Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images, [Paper]
  • (arXiv 2023.06) C2Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection, [Paper],[Code]
  • (arXiv 2023.07) PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers, [Paper]
  • (arXiv 2023.07) Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection, [Paper],[Code]
  • (arXiv 2023.07) Box-DETR: Understanding and Boxing Conditional Spatial Queries, [Paper],[Code]
  • (arXiv 2023.07) Semi-DETR: Semi-Supervised Object Detection with Detection Transformers, [Paper],[Code]
  • (arXiv 2023.07) Cascade-DETR: Delving into High-Quality Universal Object Detection, [Paper],[Code]
  • (arXiv 2023.07) Less is More: Focus Attention for Efficient DETR, [Paper], [Code]
  • (arXiv 2023.07) DQ-Det: Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation, [Ppaer]
  • (arXiv 2023.07) Enhancing Your Trained DETRs with Box Refinement, [Paper], [Code]
  • (arXiv 2023.07) RecursiveDet: End-to-End Region-based Recursive Object Detection, [Paper], [Code]
  • (arXiv 2023.07) SimDETR: Simplifying self-supervised pretraining for DETR, [Paper]
  • (arXiv 2023.08) Revisiting DETR Pre-training for Object Detection, [Paper]
  • (arXiv 2023.08) DETR Doesn’t Need Multi-Scale or Locality Design, [Paper], [Code]
  • (arXiv 2023.08) FocalFormer3D : Focusing on Hard Instance for 3D Object Detection, [Paper], [Code]
  • (arXiv 2023.08) V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection, [Paper], [Code]
  • (arXiv 2023.08) SODFormer: Streaming Object Detection with Transformer Using Events and Frames, [Paper], [Code]
  • (arXiv 2023.08) Spatial Transform Decoupling for Oriented Object Detection, [Paper], [Code]
  • (arXiv 2023.08) Towards a High-Performance Object Detector: Insights from Drone Detection Using ViT and CNN-based Deep Learning Models, [Paper]
  • (arXiv 2023.08) Enhancing Landmark Detection in Cluttered Real-World Scenarios with Vision Transformers, [Paper]
  • (arXiv 2023.09) Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection, [Paper], [Code]
  • (arXiv 2023.09) OccupancyDETR: Making Semantic Scene Completion as Straightforward as Object Detection, [Paper], [Code]
  • (arXiv 2023.10) Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection, [Paper], [Code]
  • (arXiv 2023.10) Uni3DETR: Unified 3D Detection Transformer, [Paper], [Code]
  • (arXiv 2023.10) SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation, [Paper]
  • (arXiv 2023.10) Rank-DETR for High Quality Object Detection, [Paper], [Code]
  • (arXiv 2023.10) Investigating the Robustness and Properties of Detection Transformers (DETR) Toward Difficult Images, [Paper]
  • (arXiv 2023.10) Multi Self-supervised Pre-fine-tuned Transformer Fusion for Better Intelligent Transportation Detection, [Paper]
  • (arXiv 2023.10) Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection, [Paper]
  • (arXiv 2023.10) Towards Few-Annotation Learning for Object Detection: Are Transformer-based Models More Efficient, [Paper]
  • (arXiv 2023.11) AiluRus: A Scalable ViT Framework for Dense Prediction, [Paper], [Code]
  • (arXiv 2023.11) TokenMotion: Motion-Guided Vision Transformer for Video Camouflaged Object Detection Via Learnable Token Selection, [Paper]
  • (arXiv 2023.11) Cal-DETR: Calibrated Detection Transformer, [Paper], [Code]
  • (arXiv 2023.11) FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion, [Paper]
  • (arXiv 2023.11) Algorithms for Object Detection in Substations, [Paper]
  • (arXiv 2023.11) Improved Dense Nested Attention Network Based on Transformer for Infrared Small Target Detection, [Paper]
  • (arXiv 2023.11) Decoupled DETR For Few-shot Object Detection, [Paper]
  • (arXiv 2023.12) RotaTR: Detection Transformer for Dense and Rotated Object, [Paper]
  • (arXiv 2023.12) Explainable Multi-Camera 3D Object Detection with Transformer-Based Saliency Maps, [Paper]
  • (arXiv 2023.12) Context Enhanced Transformer for Single Image Object Detection, [Paper], [Code]
  • (arXiv 2024.01) TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection, [Paper], [Code]
  • (arXiv 2024.01) MS-DETR: Efficient DETR Training with Mixed Supervision, [Paper]
  • (arXiv 2024.01) YOLO-Former: YOLO Shakes Hand With ViT, [Paper]
  • (arXiv 2024.01) Small Object Detection by DETR via Information Augmentation and Adaptive Feature Fusion, [Paper]

Edge

  • (arXiv 2022.03) EDTER: Edge Detection with Transformer, [Paper], [Code]
  • (arXiv 2022.06) XBound-Former: Toward Cross-scale Boundary Modeling in Transformers, [Paper], [Code]
  • (arXiv 2022.06) Structured Context Transformer for Generic Event Boundary Detection, [Paper]
  • (arXiv 2022.06) SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection, [Paper], [Project]
  • (arXiv 2023.07) CT-Net: Arbitrary-Shaped Text Detection via Contour Transformer, [Paper]

Enhancement

  • (arXiv 2021.11) U-shape Transformer for Underwater Image Enhancement, [Paper]
  • (arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [Paper], [Code]
  • (arXiv 2022.04) Underwater Image Enhancement Using Pre-trained Transformer, [Paper]
  • (arXiv 2022.04) VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic Retail Checkout, [Paper], [Code]
  • (arXiv 2022.05) Reinforced Swin-Convs Transformer for Underwater Image Enhancement, [Paper]
  • (arXiv 2022.07) Structural Prior Guided Generative Adversarial Transformers for Low-Light Image Enhancement, [Paper]
  • (arXiv 2022.10) End-to-end Transformer for Compressed Video Quality Enhancement, [Paper]
  • (arXiv 2022.12) WavEnhancer: Unifying Wavelet and Transformer for Image Enhancement, [Paper]
  • (arXiv 2022.12) Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method, [Paper], [Code]
  • (arXiv 2023.03) Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement, [Paper]
  • (arXiv 2023.06) Unsupervised Low Light Image Enhancement Using SNR-Aware Swin Transformer, [Paper]
  • (arXiv 2023.06) Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network, [Paper]
  • (arXiv 2023.09) Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy, [Paper], [Code]
  • (arXiv 2023.09) DEFormer: DCT-driven Enhancement Transformer for Low-light Image and Dark Vision, [Paper]
  • (arXiv 2023.10) UWFormer: Underwater Image Enhancement via a Semi-Supervised Multi-Scale Transformer, [Paper]
  • (arXiv 2023.12) A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement, [Paper], [Code]
  • (arXiv 2023.12) Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning, [Paper], [Code]
  • (arXiv 2023.12) A Non-Uniform Low-Light Image Enhancement Method with Multi-Scale Attention Transformer and Luminance Consistency Loss, [Paper], [Code]

Face

  • (arXiv 2021.03) Face Transformer for Recognition, [Paper]
  • (arXiv 2021.03) Robust Facial Expression Recognition with Convolutional Visual Transformers, [Paper]
  • (arXiv 2021.04) TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection, [Paper]
  • (arXiv 2021.04) Facial Attribute Transformers for Precise and Robust Makeup Transfer, [Paper]
  • (arXiv 2021.04) Learning to Cluster Faces via Transformer, [Paper]
  • (arXiv 2021.06) VidFace: A Full-Transformer Solver for Video Face Hallucination with Unaligned Tiny Snapshots, [Paper]
  • (arXiv 2021.06) MViT: Mask Vision Transformer for Facial Expression Recognition in the wild, [Paper]
  • (arXiv 2021.06) Shuffle Transformer with Feature Alignment for Video Face Parsing, [Paper]
  • (arXiv 2021.06) A Latent Transformer for Disentangled and Identity-Preserving Face Editing, [Paper], [Code]
  • (arXiv 2021.07) ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer, [Paper]
  • (arXiv 2021.08) FT-TDR: Frequency-guided Transformer and Top-Down Refinement Network for Blind Face Inpainting, [Paper]
  • (arXiv 2021.08) Learning Fair Face Representation With Progressive Cross Transformer, [Paper]
  • (arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [Paper]
  • (arXiv 2021.09) TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network, [Paper]
  • (arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper],[Code]
  • (arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper],[Code]
  • (arXiv 2021.09) MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition, [Paper]
  • (arXiv 2021.11) FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations, [Paper]
  • (arXiv 2021.12) SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal, [Paper],[Code]
  • (arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]
  • (arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]
  • (arXiv 2022.01) RestoreFormer: High-Quality Blind Face Restoration From Undegraded Key-Value Pairs, [Paper]
  • (arXiv 2022.03) Protecting Celebrities with Identity Consistency Transformer, [Paper]
  • (arXiv 2022.03) Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning, [Paper],[Code]
  • (arXiv 2022.03) HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule Network, [Paper]
  • (arXiv 2022.03) Mask Usage Recognition using Vision Transformer with Transfer Learning and Data Augmentation, [Paper]
  • (arXiv 2022.03) Transformer-based Multimodal Information Fusion for Facial Expression Analysis, [Paper]
  • (arXiv 2022.03) Adaptive Transformers for Robust Few-shot Cross-domain Face Anti-spoofing, [Paper]
  • (arXiv 2022.03) Facial Expression Recognition with Swin Transformer, [Paper]
  • (arXiv 2022.03) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing, [Paper],[Code]
  • (arXiv 2022.04) Vision Transformer Equipped with Neural Resizer on Facial Expression Recognition Task, [Paper]
  • (arXiv 2022.04) POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition, [Paper],[Code]
  • (arXiv 2022.05) Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild, [Paper]
  • (arXiv 2022.05) Towards Robust Blind Face Restoration with Codebook Lookup Transformer, [Paper],[Code]
  • (arXiv 2022.07) RePFormer: Refinement Pyramid Transformer for Robust Facial Landmark Detection, [Paper]
  • (arXiv 2022.07) TransFA: Transformer-based Representation for Face Attribute Evaluation, [Paper]
  • (arXiv 2022.07) FaceFormer: Scale-aware Blind Face Restoration with Transformers, [Paper]
  • (arXiv 2022.07) AU-Supervised Convolutional Vision Transformers for Synthetic Facial Expression Recognition, [Paper],[Code]
  • (arXiv 2022.07) Hybrid CNN-Transformer Model For Facial Affect Recognition In the ABAW4 Challenge, [Paper]
  • (arXiv 2022.07) Facial Expression Recognition using Vanilla ViT backbones with MAE Pretraining, [Paper]
  • (arXiv 2022.08) Towards Accurate Facial Landmark Detection via Cascaded Transformers, [Paper]
  • (arXiv 2022.10) Multi-Scale Wavelet Transformer for Face Forgery Detection, [Paper]
  • (arXiv 2022.10) Ensemble Learning using Transformers and Convolutional Networks for Masked Face Recognition, [Paper],[Code]
  • (arXiv 2022.10) GGViT:Multistream Vision Transformer Network in Face2Face Facial Reenactment Detection, [Paper]
  • (arXiv 2022.10) Prepended Domain Transformer: Heterogeneous Face Recognition without Bells and Whistles, [Paper]
  • (arXiv 2022.10) A Saccaded Visual Transformer for General Object Spotting, [Paper]
  • (arXiv 2022.10) Face Pyramid Vision Transformer, [Paper], [Project]
  • (arXiv 2022.10) UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection, [Paper]
  • (arXiv 2022.11) AU-Aware Vision Transformers for Biased Facial Expression Recognition, [Paper]
  • (arXiv 2022.11) Part-based Face Recognition with Vision Transformers, [Paper]
  • (arXiv 2022.12) Vision Transformer with Attentive Pooling for Robust Facial Expression Recognition, [Paper]
  • (arXiv 2023.01) SFI-Swin: Symmetric Face Inpainting with Swin Transformer by Distinctly Learning Face Components Distributions, [Paper]
  • (arXiv 2023.02) PhysFormer++: Facial Video-based Physiological Measurement with SlowFast Temporal Difference Transformer, [Paper]
  • (arXiv 2023.02) MorphGANFormer: Transformer-based Face Morphing and De-Morphing, [Paper]
  • (arXiv 2023.03) Enhancing General Face Forgery Detection via Vision Transformer with Low-Rank Adaptation, [Paper]
  • (arXiv 2023.03) DAA: A Delta Age AdaIN operation for age estimation via binary code transformer, [Paper]
  • (arXiv 2023.03) Precise Facial Landmark Detection by Reference Heatmap Transformer, [Paper]
  • (arXiv 2023.03) Quaternion Orthogonal Transformer for Facial Expression Recognition in the Wild, [Paper],[Code]
  • (arXiv 2023.03) Multi-Modal Facial Expression Recognition with Transformer-Based Fusion Networks and Dynamic Sampling, [Paper]
  • (arXiv 2023.03) Facial Affect Recognition based on Transformer Encoder and Audiovisual Fusion for the ABAW5 Challenge, [Paper]
  • (arXiv 2023.03) Spatial-temporal Transformer for Affective Behavior Analysis, [Paper]
  • (arXiv 2023.03) FER-former: Multi-modal Transformer for Facial Expression Recognition, [Paper]
  • (arXiv 2023.04) Face Transformer: Towards High Fidelity and Accurate Face Swapping, [Paper]
  • (arXiv 2023.04) Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition, [Paper]
  • (arXiv 2023.04) MC-ViViT: Multi-branch Classifier-ViViT to Detect Mild Cognitive Impairment in Older Adults using Facial Videos, [Paper]
  • (arXiv 2023.04) PATMAT: Person Aware Tuning of Mask-Aware Transformer for Face Inpainting, [Paper]
  • (arXiv 2023.04) MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing, [Paper]
  • (arXiv 2023.05) Noise-Resistant Multimodal Transformer for Emotion Recognition, [Paper]
  • (arXiv 2023.05) LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition, [Paper]
  • (arXiv 2023.05) FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing, [Paper]
  • (arXiv 2023.07) MiVOLO: Multi-input Transformer for Age and Gender Estimation, [Paper],[Code]
  • (arXiv 2023.07) Robust face anti-spoofing framework with Convolutional Vision Transformer, [Paper]
  • (arXiv 2023.08) RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs, [Paper]
  • (arXiv 2023.08) Dual-path TokenLearner for Remote Photoplethysmography-based Physiological Measurement with Facial Videos, [Paper],[Code]
  • (arXiv 2023.08) TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective, [Paper],[Code]
  • (arXiv 2023.08) Blind Face Restoration for Under-Display Camera via Dictionary Guided Transformer, [Paper]
  • (arXiv 2023.08) SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation, [Paper],[Code]
  • (arXiv 2023.08) A Unified Transformer-based Network for multimodal Emotion Recognition, [Paper]
  • (arXiv 2023.09) S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens, [Paper]
  • (arXiv 2023.09) Self-Supervised Transformer with Domain Adaptive Reconstruction for General Face Forgery Video Detection, [Paper]
  • (arXiv 2023.09) Forgery-aware Adaptive Vision Transformer for Face Forgery Detection, [Paper]
  • (arXiv 2023.10) 1DFormer: Learning 1D Landmark Representations via Transformer for Facial Landmark Tracking, [Paper]
  • (arXiv 2023.11) Fast and Interpretable Face Identification for Out-Of-Distribution Data Using Vision Transformers, [Paper],[Code]
  • (arXiv 2023.12) Hypergraph-Guided Disentangled Spectrum Transformer Networks for Near-Infrared Facial Expression Recognition, [Paper]
  • (arXiv 2023.12) Modality-Collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition, [Paper]
  • (arXiv 2024.01) CATFace: Cross-Attribute-Guided Transformer with Self-Attention Distillation for Low-Quality Face Recognition, [Paper]

Federated Learning

  • (arXiv 2022.11) FedTune: A Deep Dive into Efficient Federated Fine-Tuning with Pre-trained Transformers, [Paper]
  • (arXiv 2023.06) FeSViBS: Federated Split Learning of Vision Transformer with Block Sampling, [Paper],[Code]
  • (arXiv 2023.08) Pelta: Shielding Transformers to Mitigate Evasion Attacks in Federated Learning, [Paper]
  • (arXiv 2023.08) FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning, [Paper], [Code]

Few-shot Learning

  • (arXiv 2021.04) Rich Semantics Improve Few-shot Learning, [Paper], [Code]
  • (arXiv 2021.04) Few-Shot Segmentation via Cycle-Consistent Transformer, [Paper]
  • (arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper]
  • (arXiv 2021.12) Cost Aggregation Is All You Need for Few-Shot Segmentation, [Paper], [Code]
  • (arXiv 2022.01) HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning, [Paper]
  • (arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [Paper]
  • (arXiv 2022.03) Self-Promoted Supervision for Few-Shot Transformer, [Paper], [Code]
  • (arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning, [Paper], [Code]
  • (arXiv 2022.04) CATrans: Context and Affinity Transformer for Few-Shot Segmentation, [Paper]
  • (arXiv 2022.05) Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning, [Paper]
  • (arXiv 2022.05) Few-Shot Diffusion Models, [Paper]
  • (arXiv 2022.06) Prompting Decision Transformer for Few-Shot Policy Generalization, [Paper], [Code]
  • (arXiv 2022.07) Learning Cross-Image Object Semantic Relation in Transformer for Few-Shot Fine-Grained Image Classification, [Paper], [Code]
  • (arXiv 2022.07) Few-shot Object Counting and Detection, [Paper], [Code]
  • (arXiv 2022.07) Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation, [Paper], [Code]
  • (arXiv 2022.08) Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification, [Paper]
  • (arXiv 2022.10) BaseTransformers: Attention over base data-points for One Shot Learning, [Paper], [Code]
  • (arXiv 2022.10) FS-DETR: Few-Shot DEtection TRansformer with prompting and without re-training, [Paper]
  • (arXiv 2022.10) Feature-Proxy Transformer for Few-Shot Segmentation, [Paper]
  • (arXiv 2022.11) tSF: Transformer-based Semantic Filter for Few-Shot Learning, [Paper]
  • (arXiv 2022.11) Enhancing Few-shot Image Classification with Cosine Transformer, [Paper], [Code]
  • (arXiv 2023.01) Mask Matching Transformer for Few-Shot Segmentation, [Paper], [Code]
  • (arXiv 2023.01) Exploring Efficient Few-shot Adaptation for Vision Transformers, [Paper], [Code]
  • (arXiv 2023.01) Continual Few-Shot Learning Using HyperTransformers, [Paper]
  • (arXiv 2023.02) SpatialFormer: Semantic and Target Aware Attentions for Few-Shot Learning, [Paper]
  • (arXiv 2023.04) From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection, [Paper]
  • (arXiv 2023.04) Analogy-Forming Transformers for Few-Shot 3D Parsing, [Paper], [Project]
  • (arXiv 2023.05) Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting, [Paper]
  • (arXiv 2023.07) Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation, [Paper], [Code]
  • (arXiv 2023.07) Target-aware Bi-Transformer for Few-shot Segmentation, [Paper]
  • (arXiv 2023.10) PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification, [Paper]
  • (arXiv 2023.11) Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation, [Paper],[Code]

Fusion

  • (arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]
  • (arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [Paper]
  • (arXiv 2022.04) SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images, [Paper], [Code]
  • (arXiv 2022.07) Array Camera Image Fusion using Physics-Aware Transformers, [Paper]
  • (arXiv 2023.09) Holistic Dynamic Frequency Transformer for Image Fusion and Exposure Correction, [Paper]

Gait

  • (arXiv 2022.04) Spatial Transformer Network on Skeleton-based Gait Recognition, [Paper]
  • (arXiv 2022.06) Exploring Transformers for Behavioural Biometrics: A Case Study in Gait Recognition, [Paper]
  • (arXiv 2022.06) GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation, [Paper], [Code]
  • (arXiv 2022.10) Multi-view Gait Recognition based on Siamese Vision Transformer, [Paper]
  • (arXiv 2023.07) GaitFormer: Revisiting Intrinsic Periodicity for Gait Recognition, [Paper]
  • (arXiv 2023.08) GaitPT: Skeletons Are All You Need For Gait Recognition, [Paper]
  • (arXiv 2023.10) HCT: Hybrid Convnet-Transformer for Parkinson’s disease detection and severity prediction from gait, [Paper], [Code]
  • (arXiv 2023.10) GaitFormer: Learning Gait Representations with Noisy Multi-Task Learning, [Paper]
  • (arXiv 2023.11) 1D-Convolutional transformer for Parkinson disease diagnosis from gait, [Paper], [Code]
  • (arXiv 2023.11) GaitContour: Efficient Gait Recognition based on a Contour-Pose Representation, [Paper]
  • (arXiv 2023.12) Learning to Estimate Critical Gait Parameters from Single-View RGB Videos with Transformer-Based Attention Network, [Paper], [Code]

Gaze

  • (arXiv 2021.06) Gaze Estimation using Transformer, [Paper], [Code]
  • (arXiv 2022.03) End-to-End Human-Gaze-Target Detection with Transformers, [Paper]
  • (arXiv 2022.05) Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning, [Paper]
  • (arXiv 2022.08) In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation, [Paper], [Code]
  • (arXiv 2022.09) MGTR: End-to-End Mutual Gaze Detection with Transformer, [Paper], [Code]
  • (arXiv 2023.08) Interaction-aware Joint Attention Estimation Using People Attributes, [Paper], [Code]
  • (arXiv 2023.08) DVGaze: Dual-View Gaze Estimation, [Paper], [Code]
  • (arXiv 2023.10) Sharingan: A Transformer-based Architecture for Gaze Following, [Paper]
  • (arXiv 2023.11) Dual input stream transformer for eye-tracking line assignment, [Paper]
  • (arXiv 2024.01) GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance, [Paper]
  • (arXiv 2024.01) EmMixformer: Mix transformer for eye movement recognition, [Paper]

Generative Model

  • (arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]
  • (arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]
  • (arXiv 2021.04) VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers, [Paper], [Code]
  • (arXiv 2021.05) Combining Transformer Generators with Convolutional Discriminators, [Paper], [Code]
  • (arXiv 2021.06) ViT-Inception-GAN for Image Colourising, [Paper]
  • (arXiv 2021.06) Improved Transformer for High-Resolution GANs, [Paper]
  • (arXiv 2021.06) Styleformer: Transformer based Generative Adversarial Networks with Style Vector, [Paper], [Code]
  • (arXiv 2021.07) ViTGAN: Training GANs with Vision Transformers, [Paper]
  • (arXiv 2021.10) Generating Symbolic Reasoning Problems with Transformer GANs, [Paper]
  • (arXiv 2021.10) STransGAN: An Empirical Study on Transformer in GANs, [Paper], [Project]
  • (arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]
  • (arXiv 2022.01) RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark, [Paper]
  • (arXiv 2022.03) Style Transformer for Image Inversion and Editing, [Paper], [Code]
  • (arXiv 2022.06) Cycle text2face: cycle text-to-face gan via transformers, [Paper]
  • (arXiv 2022.06) Cross-Modal Transformer GAN: A Brain Structure-Function Deep Fusing Framework for Alzheimer's Disease, [Paper]
  • (arXiv 2022.08) Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model, [Paper], [Code]
  • (arXiv 2022.08) User-Controllable Latent Transformer for StyleGAN Image Layout Editing, [Paper], [Code]
  • (arXiv 2023.02) CFFT-GAN: Cross-domain Feature Fusion Transformer for Exemplar-based Image Translation, [Paper]
  • (arXiv 2023.02) TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary One-Shot Image Generation, [Paper]
  • (arXiv 2023.03) StraIT: Non-autoregressive Generation with Stratified Image Transformer, [Paper]
  • (arXiv 2023.03) Graph Transformer GANs for Graph-Constrained House Generation, [Paper]
  • (arXiv 2023.03) Investigating GANsformer: A Replication Study of a State-of-the-Art Image Generation Model, [Paper]
  • (arXiv 2023.03) StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model, [Paper], [Code]
  • (arXiv 2023.03) Q-RBSA: High-Resolution 3D EBSD Map Generation Using An Efficient Quaternion Transformer Network, [Paper]
  • (arXiv 2023.05) Reinforcement Learning finetuned Vision-Code Transformer for UI-to-Code Generation, [Paper]
  • (arXiv 2023.06) A Conditional Generative Chatbot using Transformer Model, [Paper]
  • (arXiv 2023.07) StylePrompter: All Styles Need Is Attention, [Paper], [Code]
  • (arXiv 2023.07) Enhancing Object Detection in Ancient Documents with Synthetic Data Generation and Transformer-Based Models, [Paper]
  • (arXiv 2023.08) Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts, [Paper], [Code]
  • (arXiv 2023.10) Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers, [Paper]
  • (arXiv 2023.12) GIVT: Generative Infinite-Vocabulary Transformers, [Paper]

Graph

  • (arXiv 2022.09) Graph Reasoning Transformer for Image Parsing, [Paper]
  • (arXiv 2022.11) Rethinking Batch Sample Relationships for Data Representation: A Batch-Graph Transformer based Approach, [Paper]
  • (arXiv 2022.12) A Generalization of ViT/MLP-Mixer to Graphs, [Paper], [Code]
  • (arXiv 2023.02) Energy Transformer, [Paper], [Code]
  • (arXiv 2023.02) MulGT: Multi-task Graph-Transformer with Task-aware Knowledge Injection and Domain Knowledge-driven Pooling for Whole Slide Image Analysis, [Paper]
  • (arXiv 2023.02) Contrastive Video Question Answering via Video Graph Transformer, [Paper], [Code]
  • (arXiv 2023.03) AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images, [Paper], [Code]
  • (arXiv 2023.03) An Adaptive GViT for Gas Mixture Identification and Concentration Estimation, [Paper]
  • (arXiv 2023.04) Transformer-based Graph Neural Networks for Outfit Generation, [Paper]
  • (arXiv 2023.05) GTNet: Graph Transformer Network for 3D Point Cloud Classification and ation, [Paper]
  • (arXiv 2023.05) Multi-scale Efficient Graph-Transformer for Whole Slide Image Classification, [Paper]
  • (arXiv 2023.06) NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning, [Paper]
  • (arXiv 2023.08) Geometric Learning-Based Transformer Network for Estimation of Segmentation Errors, [Paper]
  • (arXiv 2023.08) Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images, [Paper]
  • (arXiv 2023.08) Deep Prompt Tuning for Graph Transformers, [Paper]
  • (arXiv 2023.11) GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation, [Paper], [Code]
  • (arXiv 2023.11) GMTR: Graph Matching Transformers, [Paper]
  • (arXiv 2023.12) GSGFormer: Generative Social Graph Transformer for Multimodal Pedestrian Trajectory Prediction, [Paper]
  • (arXiv 2023.12) Large-scale Graph Representation Learning of Dynamic Brain Connectome with Transformers, [Paper]
  • (arXiv 2024.01) Graph Transformer GANs with Graph Masked Modeling for Architectural Layout Generation, [Paper]

Hand Gesture

  • (arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density EMG Signals, [Paper]
  • (arXiv 2023.07) Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting, [Paper], [Code]
  • (arXiv 2023.08) Nonrigid Object Contact Estimation With Regional Unwrapping Transformer, [Paper]
  • (arXiv 2023.10) BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer, [Paper]
  • (arXiv 2023.11) Improving Hand Recognition in Uncontrolled and Uncooperative Environments using Multiple Spatial Transformers and Loss Functions, [Paper]
  • (arXiv 2023.12) Reconstructing Hands in 3D with Transformers, [Paper], [Code]

High Dynamic Range Imaging

  • (arXiv 2022.08) Ghost-free High Dynamic Range Imaging with Context-aware Transformer, [Paper], [Code]
  • (arXiv 2023.03) SpiderMesh: Spatial-aware Demand-guided Recursive Meshing for RGB-T ation, [Paper]
  • (arXiv 2023.04) High Dynamic Range Imaging with Context-aware Transformer, [Paper]
  • (arXiv 2023.05) Alignment-free HDR Deghosting with Semantics Consistent Transformer, [Paper], [Code]
  • (arXiv 2023.09) IFT: Image Fusion Transformer for Ghost-free High Dynamic Range Imaging, [Paper]

HOI

  • (CVPR'21) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [Paper], [Code]
  • (arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]
  • (arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]
  • (arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]
  • (arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
  • (arXiv 2021.08) GTNet:Guided Transformer Network for Detecting Human-Object Interactions, [Paper], [Code]
  • (arXiv 2021.12) Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer, [Paper], [Code]
  • (arXiv 2022.03) Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows, [Paper]
  • (arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection, [Paper]
  • (arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions, [Paper]
  • (arXiv 2022.04) End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation, [Paper], [Code]
  • (arXiv 2022.04) Category-Aware Transformer Network for Better Human-Object Interaction Detection, [Paper]
  • (arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection, [Paper], [Code]
  • (arXiv 2022.04) Human-Object Interaction Detection via Disentangled Transformer, [Paper]
  • (arXiv 2022.06) Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection, [Paper], [Code]
  • (arXiv 2022.07) Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection, [Paper], [Code]
  • (arXiv 2022.07) IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition, [Paper]
  • (arXiv 2023.04) ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection, [Paper], [Code]
  • (arXiv 2023.08) Exploring Predicate Visual Context in Detecting of Human–Object Interactions, [Paper], [Code]
  • (arXiv 2023.08) Compositional Learning in Transformer-Based Human-Object Interaction Detection, [Paper]
  • (arXiv 2023.08) Agglomerative Transformer for Human-Object Interaction Detection, [Paper], [Code]
  • (arXiv 2024.01) A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition, [Paper], [Code]

Hyperspectral

  • (arXiv 2021.07) SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers, [Paper], [Code]
  • (arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]
  • (arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]
  • (arXiv 2021.11) Learning A 3D-CNN and Transformer Prior for Hyperspectral Image Super-Resolution, [Paper]
  • (arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening, [Paper]
  • (arXiv 2022.03) Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classificationtion, [Paper]
  • (arXiv 2022.03) Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction, [Paper]
  • (arXiv 2022.03) Deep Hyperspectral Unmixing using Transformer Network, [Paper], [Code]
  • (arXiv 2022.04) MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction, [Paper], [Code]
  • (arXiv 2022.09) S^2-Transformer for Mask-Aware Hyperspectral Image Reconstruction, [Paper]
  • (arXiv 2023.03) MSFA-Frequency-Aware Transformer for Hyperspectral Images Demosaicing, [Paper]
  • (arXiv 2023.04) MethaneMapper: Spectral Absorption aware Hyperspectral Transformer for Methane Detection, [Paper]
  • (arXiv 2023.04) DCN-T: Dual Context Network with Transformer for Hyperspectral Image Classification, [Paper], [Code]
  • (arXiv 2023.05) SST-ReversibleNet: Reversible-prior-based Spectral-Spatial Transformer for Efficient Hyperspectral Image Reconstruction, [Paper], [Code]
  • (arXiv 2023.06) SaaFormer: Spectral-spatial Axial Aggregation Transformer for Hyperspectral Image Classification, [Paper]
  • (arXiv 2023.08) Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction, [Paper], [Code]
  • (arXiv 2023.09) FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised Pre-Training, [Paper], [Code]
  • (arXiv 2023.10) Multiview Transformer: Rethinking Spatial Information in Hyperspectral Image Classification, [Paper]
  • (arXiv 2023.10) MLP-AMDC: An MLP Architecture for Adaptive-Mask-based Dual-Camera snapshot hyperspectral imaging, [Paper], [Code]
  • (arXiv 2023.11) Learning transformer-based heterogeneously salient graph representation for multimodal fusion classification of hyperspectral image and LiDAR data, [Paper]
  • (arXiv 2023.12) Pixel-to-Abundance Translation: Conditional Generative Adversarial Networks Based on Patch Transformer for Hyperspectral Unmixing, [Paper]

Illumination

  • (arXiv 2022.05) Illumination Adaptive Transformer, [Paper], [Code]

Incremental Learning

  • (arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]
  • (arXiv 2022.03) Meta-attention for ViT-backed Continual Learning, [Paper], [Code]
  • (arXiv 2022.03)Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [Paper]
  • (arXiv 2022.08) D3Former: Debiased Dual Distilled Transformer for Incremental Learning, [Paper], [Code]
  • (arXiv 2022.10) A Memory Transformer Network for Incremental Learning, [Paper]
  • (arXiv 2023.01) Combined Use of Federated Learning and Image Encryption for Privacy-Preserving Image Classification with Vision Transformer, [Paper]
  • (arXiv 2023.03) Learning to Grow Artificial Hippocampi in Vision Transformers for Resilient Lifelong Learning, [Paper]
  • (arXiv 2023.03) Dense Network Expansion for Class Incremental Learning, [Paper]
  • (arXiv 2023.03) Semantic-visual Guided Transformer for Few-shot Class-incremental Learning, [Paper]
  • (arXiv 2023.04) Continual Detection Transformer for Incremental Object Detection, [Paper]
  • (arXiv 2023.04) Preserving Locality in Vision Transformers for Class Incremental Learning, [Paper]
  • (arXiv 2023.05) BiRT: Bio-inspired Replay in Vision Transformers for Continual Learning, [Paper], [Code]
  • (arXiv 2023.06) TADIL: Task-Agnostic Domain-Incremental Learning through Task-ID Inference using Transformer Nearest-Centroid Embeddings, [Paper]
  • (arXiv 2023.08) On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers, [Paper], [Code]
  • (arXiv 2023.08) Exemplar-Free Continual Transformer with Convolutions, [Paper], [Projet]
  • (arXiv 2023.08) Introducing Language Guidance in Prompt-based Continual Learning, [Paper]
  • (arXiv 2023.11) CMFDFormer: Transformer-based Copy-Move Forgery Detection with Continual Learning, [Paper]
  • (arXiv 2023.12) Fine-Grained Knowledge Selection and Restoration for Non-Exemplar Class Incremental Learning, [Paper], [Code]

In-painting

  • (ECCV'20) Learning Joint Spatial-Temporal Transformations for Video Inpainting, [Paper], [Code]
  • (arXiv 2021.04) Aggregated Contextual Transformations for High-Resolution Image Inpainting, [Paper], [Code]
  • (arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [Paper], [Code]
  • (arXiv 2022.03) Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding, [Paper], [Code]
  • (arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image Inpainting, [Paper], [Code]
  • (arXiv 2022.05) Reduce Information Loss in Transformers for Pluralistic Image Inpainting, [Paper]
  • (arXiv 2022.08) Flow-Guided Transformer for Video Inpainting, [Paper], [Code]
  • (arXiv 2022.09) DeViT: Deformed Vision Transformers in Video Inpainting, [Paper]
  • (arXiv 2022.10) TPFNet: A Novel Text In-painting Transformer for Text Removal, [Paper], [Code]
  • (arXiv 2023.01) Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting, [Paper]
  • (arXiv 2023.05) T-former: An Efficient Transformer for Image Inpainting, [Paper], [Code]
  • (arXiv 2023.06) TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting, [Paper], [Code]
  • (arXiv 2023.07) Deficiency-Aware Masked Transformer for Video Inpainting, [Paper], [Code]
  • (arXiv 2023.09) ProPainter: Improving Propagation and Transformer for Video Inpainting, [Paper], [Code]
  • (arXiv 2024.01) Federated Class-Incremental Learning with Prototype Guided Transformer, [Paper], [Code]

Instance Segmentation

  • (CVPR'21) End-to-End Video Instance Segmentation with Transformers, [Paper], [Code]
  • (arXiv 2021.04) ISTR: End-to-End Instance Segmentation with Transformers, [Paper], [Code]
  • (arXiv 2021.08) SOTR: Segmenting Objects with Transformers, [Paper], [Code]
  • (arXiv 2021.12) SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation, [Paper], [Code]
  • (arXiv 2021.12) A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation, [Paper]
  • (arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]
  • (arXiv 2022.03) Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer, [Paper], [Code]
  • (arXiv 2022.04) Temporally Efficient Vision Transformer for Video Instance Segmentation, [Paper], [Code]
  • (arXiv 2022.04) Less than Few: Self-Shot Video Instance Segmentation, [Paper]
  • (arXiv 2022.06) Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention, [Paper]
  • (arXiv 2022.06) Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation, [Paper], [Code]
  • (arXiv 2022.07) OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers, [Paper], [Code]
  • (arXiv 2022.07) In Defense of Online Models for Video Instance Segmentation, [Paper], [Code]
  • (arXiv 2022.07) Video Mask Transfiner for High-Quality Video Instance Segmentation, [Paper], [Project]
  • (arXiv 2022.08) InstanceFormer: An Online Video Instance Segmentation Framework, [Paper], [Code]
  • (arXiv 2022.09) RNGDet++: Road Network Graph Detection by Transformer with Instance Segmentation and Multi-scale Features Enhancement, [Paper], [Code]
  • (arXiv 2022.10) AISFormer: Amodal Instance Segmentation with Transformer, [Paper], [Code]
  • (arXiv 2022.10) TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation, [Paper], [Code]
  • (arXiv 2022.11) Mean Shift Mask Transformer for Unseen Object Instance Segmentation, [Paper], [Code]
  • (arXiv 2022.11) Transformer for 3D Scene Instance Segmentation, [Paper], [Code]
  • (arXiv 2023.01) Vision Transformers Are Good Mask Auto-Labelers, [Paper], [Code]
  • (arXiv 2023.01) Towards Robust Video Instance Segmentation with Temporal-Aware Transformer, [Paper]
  • (arXiv 2023.03) MobileInst: Video Instance Segmentation on the Mobile, [Paper]
  • (arXiv 2023.04) DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer, [Paper]
  • (arXiv 2023.04) Vision Transformers Are Good Mask Auto-Labelers, [Paper], [Code]
  • (arXiv 2023.06) CalibNet: Dual-branch Cross-modal Calibration for RGB-D Salient Instance Segmentation, [Paper], [Code]
  • (arXiv 2023.08) Partitioned Saliency Ranking with Dense Pyramid Transformers, [Paper], [Code]
  • (arXiv 2023.08) Exploring Transformers for Open-world Instance Segmentation, [Paper]
  • (arXiv 2023.08) Mask Frozen-DETR: High Quality Instance Segmentation with One GPU, [Paper]
  • (arXiv 2023.08) A Unified Query-based Paradigm for Camouflaged Instance Segmentation, [Paper], [Code]
  • (arXiv 2023.08) NOVIS: A Case for End-to-End Near-Online Video Instance Segmentation, [Paper]
  • (arXiv 2023.09) Mask-Attention-Free Transformer for 3D Instance Segmentation, [Paper], [Code]
  • (arXiv 2023.09) TCOVIS: Temporally Consistent Online Video Instance Segmentation, [Paper], [Code]
  • (arXiv 2023.09) 3D Indoor Instance Segmentation in an Open-World, [Paper], [Code]
  • (arXiv 2023.10) MSFormer: A Skeleton-multiview Fusion Method For Tooth Instance Segmentation, [Paper]
  • (arXiv 2023.12) PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation, [Paper],[Code]
  • (arXiv 2023.12) EipFormer: Emphasizing Instance Positions in 3D Instance Segmentation, [Paper]

Knowledge Distillation

  • (arXiv 2022.04) DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers, [Paper]
  • (arXiv 2022.05) Knowledge Distillation via the Target-aware Transformer, [Paper]
  • (arXiv 2022.05) Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation, [Paper], [Code]
  • (arXiv 2022.09) ViTKD: Practical Guidelines for ViT feature knowledge distillation, [Paper], [Code]
  • (arXiv 2022.11) Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling, [Paper]
  • (arXiv 2022.11) D3ETR: Decoder Distillation for Detection Transformer, [Paper]
  • (arXiv 2022.11) DETRDistill: A Universal Knowledge Distillation Framework for DETR-families, [Paper]
  • (arXiv 2022.12) Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning, [Paper], [Code]
  • (arXiv 2022.12) OVO: One-shot Vision Transformer Search with Online distillation, [Paper]
  • (arXiv 2023.02) Knowledge Distillation in Vision Transformers: A Critical Review, [Paper]
  • (arXiv 2023.02) MaskedKD: Efficient Distillation of Vision Transformers with Masked Images, [Paper]
  • (arXiv 2023.03) Multi-view knowledge distillation transformer for human action recognition, [Paper]
  • (arXiv 2023.03) Supervised Masked Knowledge Distillation for Few-Shot Transformers, [Paper], [Code]
  • (arXiv 2023.05) Vision Transformers for Small Histological Datasets Learned through Knowledge Distillation, [Paper]
  • (arXiv 2023.05) Are Large Kernels Better Teachers than Transformers for ConvNets?, [Paper], [Code]
  • (arXiv 2023.07) Cumulative Spatial Knowledge Distillation for Vision Transformers, [Paper]
  • (arXiv 2023.10) CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction, [Paper], [Code]
  • (arXiv 2023.10) Distilling Efficient Vision Transformers from CNNs for ation, [Paper], [Code]
  • (arXiv 2023.10) One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation, [Paper], [Code]
  • (arXiv 2023.11) Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples, [Paper]
  • (arXiv 2023.12) GIST: Improving Parameter Efficient Fine Tuning via Knowledge Interaction, [Paper]

Lane

  • (arXiv 2022.03) Laneformer: Object-aware Row-Column Transformers for Lane Detection, [Paper]
  • (arXiv 2022.03) PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark, [Paper], [Project]
  • (arXiv 2022.09) PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer, [Paper], [Code]
  • (arXiv 2022.09) CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention, [Paper]
  • (arXiv 2023.08) LATR: 3D Lane Detection from Monocular Images with Transformer, [Paper], [Code]

Layout

  • (CVPR'21) Variational Transformer Networks for Layout Generation, [Paper]
  • (arXiv 2021.10) The Layout Generation Algorithm of Graphic Design Based on Transformer-CVAE, [Paper]
  • (arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]
  • (arXiv 2022.02) ATEK: Augmenting Transformers with Expert Knowledge for Indoor Layout Synthesis, [Paper]
  • (arXiv 2022.03) LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network, [Paper], [Code]
  • (arXiv 2022.08) UniLayout: Taming Unified Sequence-to-Sequence Transformers for Graphic Layout Generation, [Paper]
  • (arXiv 2022.09) Geometry Aligned Variational Transformer for Image-conditioned Layout Generation, [Paper]
  • (arXiv 2022.12) LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer, [Paper], [Code]
  • (arXiv 2022.12) PanoViT: Vision Transformer for Room Layout Estimation from a Single Panoramic Image, [Paper]
  • (arXiv 2023.03) DLT: Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transformer, [Paper]
  • (arXiv 2023.04) GUILGET: GUI Layout GEneration with Transformer, [Paper]
  • (arXiv 2023.05) LayoutDM: Transformer-based Diffusion Model for Layout Generation, [Paper]
  • (arXiv 2023.08) MapPrior: Bird’s-Eye View Map Layout Estimation with Generative Models, [Paper], [Code]
  • (arXiv 2023.08) Vision Grid Transformer for Document Layout Analysis, [Paper], [Code]
  • (arXiv 2023.08) Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis, [Paper]
  • (arXiv 2023.10) Dolfin: Diffusion Layout Transformers without Autoencoder, [Paper]
  • (arXiv 2023.11) LayoutPrompter: Awaken the Design Ability of Large Language Models, [Paper], [Code]
  • (arXiv 2023.11) Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation, [Paper], [Project]

Lighting

  • (arXiv 2022.02) Spatio-Temporal Outdoor Lighting Aggregation on Image Sequences using Transformer Networks, [Paper]
  • (arXiv 2023.05) Ray-Patch: An Efficient Decoder for Light Field Transformers, [Paper]

LLM/LVM

  • (arXiv 2023.11) NExT-Chat: An LMM for Chat, Detection and Segmentation, [Paper], [Code]
  • (arXiv 2023.11) u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model, [Paper]
  • (arXiv 2023.11) Towards Open-Ended Visual Recognition with Large Language Model, [Paper], [Code]
  • (arXiv 2023.11) Stable Segment Anything Model, [Paper], [Code]
  • (arXiv 2023.11) Adapter is All You Need for Tuning Visual Tasks, [Paper], [Code]
  • (arXiv 2023.11) LLaFS: When Large-Language Models Meet Few-Shot Segmentation, [Paper], [Code]
  • (arXiv 2023.11) Efficient In-Context Learning in Vision-Language Models for Egocentric Videos, [Paper], [Code]
  • (arXiv 2023.11) Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model, [Paper]
  • (arXiv 2023.11) PoseGPT: Chatting about 3D Human Pose, [Paper], [Code]
  • (arXiv 2023.11) InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation, [Paper], [Code]
  • (arXiv 2023.11) Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models, [Paper], [Code]
  • (arXiv 2023.11) Contrastive Vision-Language Alignment Makes Efficient Instruction Learner, [Paper], [Code]
  • (arXiv 2023.12) Bootstrapping SparseFormers from Vision Foundation Models, [Paper], [Code]
  • (arXiv 2023.12) IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks, [Paper], [Code]
  • (arXiv 2023.12) Segment and Caption Anything, [Paper], [Code]
  • (arXiv 2023.12) EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything, [Paper]
  • (arXiv 2023.12) Segment Any 3D Gaussians, [Paper], [Code]
  • (arXiv 2023.12) Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts, [Paper]
  • (arXiv 2023.12) PixelLM: Pixel Reasoning with Large Multimodal Model, [Paper], [Code]
  • (arXiv 2023.12) Foundation Model Assisted Weakly Supervised Semantic Segmentation, [Paper]
  • (arXiv 2023.12) AI-SAM: Automatic and Interactive Segment Anything Model, [Paper], [Code]
  • (arXiv 2023.12) MobileSAMv2: Faster Segment Anything to Everything, [Paper],[Code]
  • (arXiv 2023.12) MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices, [Paper],[Code]
  • (arXiv 2024.01) One for All: Toward Unified Foundation Models for Earth Vision, [Paper]

Matching

  • (CVPR'21') LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]
  • (arXiv 2022.02) Local Feature Matching with Transformers for low-end devices, [Paper], [Code]
  • (arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [Paper], [Code]
  • (arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for Feature Matching, [Paper], [Code]
  • (arXiv 2022.05) TransforMatcher: Match-to-Match Attention for Semantic Correspondence, [Paper], [Code]
  • (arXiv 2022.07) Deep Laparoscopic Stereo Matching with Transformers, [Paper]
  • (arXiv 2022.08) ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer, [Paper], [Project]
  • (arXiv 2023.01) DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching, [Paper], [Code]
  • (arXiv 2023.03) ParaFormer: Parallel Attention Transformer for Efficient Feature Matching, [Paper]
  • (arXiv 2023.03) Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints, [Paper]
  • (arXiv 2023.03) Adaptive Spot-Guided Transformer for Consistent Local Feature Matching, [Paper], [Code]
  • (arXiv 2023.05) AMatFormer: Efficient Feature Matching via Anchor Matching Transformer, [Paper]
  • (arXiv 2023.08) Multi-scale Alternated Attention Transformer for Generalized Stereo Matching, [Paper]
  • (arXiv 2023.10) FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer, [Paper]
  • (arXiv 2023.11) LGFCTR: Local and Global Feature Convolutional Transformer for Image Matching, [Paper], [Code]
  • (arXiv 2023.12) Latent Space Editing in Transformer-Based Flow Matching, [Paper], [Code]

Matting

  • (arXiv 2022.03) MatteFormer: Transformer-Based Image Matting via Prior-Tokens, [Paper], [Code]
  • (arXiv 2022.08) TransMatting: Enhancing Transparent Objects Matting with Transformers, [Paper], [Code]
  • (arXiv 2022.08) VMFormer: End-to-End Video Matting with Transformer, [Paper], [Project]
  • (arXiv 2023.03) TransMatting: Tri-token Equipped Transformer Model for Image Matting, [Paper], [Project]
  • (arXiv 2023.05) ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers, [Paper]
  • (arXiv 2023.08) EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting, [Paper]

Medical

  • (arXiv 2021.02) TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.02) Medical Transformer: Gated Axial-Attention for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.03) SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation, [Paper], [Code]
  • (arXiv 2021.03) TransBTS: Multimodal Brain Tumor Segmentation Using Transformer, [Paper], [Code]
  • (arXiv 2021.03) TransMed: Transformers Advance Multi-modal Medical Image Classification, [Paper]
  • (arXiv 2021.03) U-Net Transformer: Self and Cross Attention for Medical Image Segmentation, [Paper]
  • (arXiv 2021.03) SUNETR: Transformers for 3D Medical Image Segmentation, [Paper]
  • (arXiv 2021.04) DeepProg: A Multi-modal Transformer-based End-to-end Framework for Predicting Disease Prognosis, [Paper]
  • (arXiv 2021.04) Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification, [Paper]
  • (arXiv 2021.04) Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer, [Paper]
  • (arXiv 2021.04) Medical Transformer: Universal Brain Encoder for 3D MRI Analysis, [Paper]
  • (arXiv 2021.04) Crossmodal Matching Transformer for Interventional in TEVAR, [Paper]
  • (arXiv 2021.04) GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification, [Paper]
  • (arXiv 2021.04) Pyramid Medical Transformer for Medical Image Segmentation, [Paper]
  • (arXiv 2021.05) Anatomy-Guided Parallel Bottleneck Transformer Network for Automated Evaluation of Root Canal Therapy, [Paper]
  • (arXiv 2021.05) Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.05) Is Image Size Important? A Robustness Comparison of Deep Learning Methods for Multi-scale Cell Image Classification Tasks: from Convolutional Neural Networks to Visual Transformers, [Paper]
  • (arXiv 2021.05) Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers, [Paper]
  • (arXiv 2021.05) Medical Image Segmentation using Squeeze-and-Expansion Transformers, [Paper], [Code]
  • (arXiv 2021.05) POCFormer: A Lightweight Transformer Architecture for Detection of COVID-19 Using Point of Care Ultrasound, [Paper]
  • (arXiv 2021.05) COTR: Convolution in Transformer Network for End to End Polyp Detection, [Paper]
  • (arXiv 2021.05) PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer, [Paper]
  • (arXiv 2021.06) TED-net: Convolution-free T2T Vision Transformerbased Encoder-decoder Dilation network for Low-dose CT Denoising, [Paper]
  • (arXiv 2021.06) A Multi-Branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation, [Paper]
  • (arXiv 2021.06) Task Transformer Network for Joint MRI Reconstruction and Super-Resolution, [Paper], [Code]
  • (arXiv 2021.06) DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation, [Paper]
  • (arXiv 2021.06) More than Encoder: Introducing Transformer Decoder to Upsample, [Paper]
  • (arXiv 2021.06) Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image, [Paper]
  • (arXiv 2021.06) MTrans: Multi-Modal Transformer for Accelerated MR Imaging, [Paper], [Code]
  • (arXiv 2021.06) Multi-Compound Transformer for Accurate Biomedical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.07) ResViT: Residual vision transformers for multi-modal medical image synthesis, [Paper]
  • (arXiv 2021.07) E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception, [Paper]
  • (arXiv 2021.07) UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation, [Paper]
  • (arXiv 2021.07) COVID-VIT: Classification of Covid-19 from CT chest images based on vision transformer models, [Paper]
  • (arXiv 2021.07) RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting, [Paper], [Code]
  • (arXiv 2021.07) Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation, [Paper]
  • (arXiv 2021.07) Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries, [Paper]
  • (arXiv 2021.07) EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification, [Paper]
  • (arXiv 2021.07) Visual Transformer with Statistical Test for COVID-19 Classification, [Paper]
  • (arXiv 2021.07) TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation, [Paper]
  • (arXiv 2021.07) Few-Shot Domain Adaptation with Polymorphic Transformers, [Paper], [Code]
  • (arXiv 2021.07) TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation, [Paper]
  • (arXiv 2021.07) Surgical Instruction Generation with Transformers, [Paper]
  • (arXiv 2021.07) LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.07) TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee Topology Preservation in Segmentations, [Paper], [Code]
  • (arXiv 2021.08) Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers, [Paper], [Code]
  • (arXiv 2021.08) Is it Time to Replace CNNs with Transformers for Medical Images, [Paper], [Code]
  • (arXiv 2021.09) nnFormer: Interleaved Transformer for Volumetric Segmentation, [Paper], [Code]
  • (arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
  • (arXiv 2021.09) MISSFormer: An Effective Medical Image Segmentation Transformer, [Paper]
  • (arXiv 2021.09) Eformer: Edge Enhancement based Transformer for Medical Image Denoising, [Paper]
  • (arXiv 2021.09) Transformer-Unet: Raw Image Processing with Unet, [Paper]
  • (arXiv 2021.09) BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation, [Paper]
  • (arXiv 2021.09) GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation, [Paper]
  • (arXiv 2021.10) Transformer Assisted Convolutional Network for Cell Instance Segmentation, [Paper]
  • (arXiv 2021.10) A transformer-based deep learning approach for classifying brain metastases into primary organ sites using clinical whole brain MRI images, [Paper]
  • (arXiv 2021.10) Boundary-aware Transformers for Skin Lesion Segmentation, [Paper], [Code]
  • (arXiv 2021.10) Vision Transformer based COVID-19 Detection using Chest X-rays, [Paper]
  • (arXiv 2021.10) Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining, [Paper], [Code]
  • (arXiv 2021.10) CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans, [Paper], [Code]
  • (arXiv 2021.10) COVID-19 Detection in Chest X-ray Images Using Swin-Transformer and Transformer in Transformer, [Paper], [Code]
  • (arXiv 2021.10) Bilateral-ViT for Robust Fovea Localization, [Paper]
  • (arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]
  • (arXiv 2021.10) Vision Transformer for Classification of Breast Ultrasound Images, [Paper]
  • (arXiv 2021.11) Federated Split Vision Transformer for COVID-19CXR Diagnosis using Task-Agnostic Training, [Paper]
  • (arXiv 2021.11) Hepatic vessel segmentation based on 3D swin-transformer with inductive biased multi-head self-attention, [Paper]
  • (arXiv 2021.11) Lymph Node Detection in T2 MRI with Transformers, [Paper]
  • (arXiv 2021.11) Mixed Transformer U-Net For Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2021.11) Transformer for Polyp Detection, [Paper]
  • (arXiv 2021.11) DuDoTrans: Dual-Domain Transformer Provides More Attention for Sinogram Restoration in Sparse-View CT Reconstruction, [Paper], [Code]
  • (arXiv 2021.11) A Volumetric Transformer for Accurate 3D Tumor Segmentation, [Paper], [Code]
  • (arXiv 2021.11) Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis, [Paper], [Code]
  • (arXiv 2021.11) MIST-net: Multi-domain Integrative Swin Transformer network for Sparse-View CT Reconstruction, [Paper]
  • (arXiv 2021.12) MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification, [Paper], [Code]
  • (arXiv 2021.12) 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis, [Paper], [Code]
  • (arXiv 2021.12) Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer, [Paper], [Code]
  • (arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper], [Code]
  • (arXiv 2021.12) MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer, [Paper], [Code]
  • (arXiv 2022.01) D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation, [Paper]
  • (arXiv 2022.01) Swin UNETR: Swin Transformers for ation of Brain Tumors in MRI Images, [Paper], [Code]
  • (arXiv 2022.01) Swin Transformer for Fast MRI, [Paper], [Code]
  • (arXiv 2022.01) ViTBIS: Vision Transformer for Biomedical Image Segmentation, [Paper]
  • (arXiv 2022.01) Improving Across-Dataset Brain Tissue Segmentation Using Transformer, [Paper], [Code]
  • (arXiv 2022.01) SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation, [Paper], [Code]
  • (arXiv 2022.01) ReconFormer: Accelerated MRI Reconstruction Using Recurrent Transformer, [Paper], [Code]
  • (arXiv 2022.01) Fast MRI Reconstruction: How Powerful Transformers Are, [Paper]
  • (arXiv 2022.01) Class-Aware Generative Adversarial Transformers for Medical Image Segmentation, [Paper]
  • (arXiv 2022.01) RTNet: Relation Transformer Network for Diabetic Retinopathy Multi-lesion Segmentation, [Paper]
  • (arXiv 2022.01) Joint Liver and Hepatic Lesion Segmentation using a Hybrid CNN with Transformer Layers, [Paper]
  • (arXiv 2022.01) DSFormer: A Dual-domain Self-supervised Transformer for Accelerated Multi-contrast MRI Reconstruction, [Paper]
  • (arXiv 2022.01) TransPPG: Two-stream Transformer for Remote Heart Rate Estimate, [Paper]
  • (arXiv 2022.01) TransBTSV2: Wider Instead of Deeper Transformer for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2022.01) Brain Cancer Survival Prediction on Treatment-na ive MRI using Deep Anchor Attention Learning with Vision Transformer, [Paper]
  • (arXiv 2022.02) Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers, [Paper], [Code]
  • (arXiv 2022.02) AI can evolve without labels: self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation, [Paper]
  • (arXiv 2022.02) ScoreNet: Learning Non-Uniform Attention and Augmentation for Transformer-Based Histopathological Image Classification, [Paper]
  • (arXiv 2022.02) A hybrid 2-stage vision transformer for AI-assisted 5 class pathologic diagnosis of gastric endoscopic biopsies, [Paper]
  • (arXiv 2022.02) TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery, [Paper]
  • (arXiv 2022.02) RadioTransformer: A Cascaded Global-Focal Transformer for Visual Attention-guided Disease Classification, [Paper]
  • (arXiv 2022.03) Using Multi-scale SwinTransformer-HTC with Data augmentation in CoNIC Challenge, [Paper]
  • (arXiv 2022.03) CTformer: Convolution-free Token2Token Dilated Vision Transformer for Low-dose CT Denoising, [Paper], [Code]
  • (arXiv 2022.03) Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology, [Paper], [Code]
  • (arXiv 2022.03) A Multi-scale Transformer for Medical Image Segmentation: Architectures, Model Efficiency, and Benchmarks, [Paper], [Code]
  • (arXiv 2022.03) Tempera: Spatial Transformer Feature Pyramid Network for Cardiac MRI Segmentation, [Paper]
  • (arXiv 2022.03) Contextual Attention Network: Transformer Meets U-Net, [Paper], [Code]
  • (arXiv 2022.03) Characterizing Renal Structures with 3D Block Aggregate Transformers, [Paper]
  • (arXiv 2022.03) Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification, [Paper]
  • (arXiv 2022.03) Active Phase-Encode Selection for Slice-Specific Fast MR Scanning Using a Transformer-Based Deep Reinforcement Learning Framework, [Paper]
  • (arXiv 2022.03) Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4, [Paper]
  • (arXiv 2022.03) SATr: Slice Attention with Transformer for Universal Lesion Detection, [Paper]
  • (arXiv 2022.03) Simulation-Driven Training of Vision Transformers Enabling Metal Segmentation in X-Ray Images, [Paper]
  • (arXiv 2022.03) TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation with Transformers, [Paper]
  • (arXiv 2022.03) Adaptively Re-weighting Multi-Loss Untrained Transformer for Sparse-View Cone-Beam CT Reconstruction, [Paper]
  • (arXiv 2022.03) Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection, [Paper]
  • (arXiv 2022.03) Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution, [Paper], [Code]
  • (arXiv 2022.03) Cross-Modality High-Frequency Transformer for MR Image Super-Resolution, [Paper]
  • (arXiv 2022.03) CAT-Net: A Cross-Slice Attention Transformer Model for Prostate Zonal Segmentation in MRI, [Paper]
  • (arXiv 2022.04) UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2022.04) Data and Physics Driven Learning Models for Fast MRI -- Fundamentals and Methodologies from CNN, GAN to Attention and Transformers, [Paper]
  • (arXiv 2022.04) CCAT-NET: A Novel Transformer Based Semi-supervised Framework for Covid-19 Lung Lesion Segmentation, [Paper]
  • (arXiv 2022.04) Surface Vision Transformers: Flexible Attention-Based Modelling of Biomedical Surfaces, [Paper], [Code]
  • (arXiv 2022.04) Low-Dose CT Denoising via Sinogram Inner-Structure Transformer, [Paper]
  • (arXiv 2022.04) 3D Shuffle-Mixer: An Efficient Context-Aware Vision Learner of Transformer-MLP Paradigm for Dense Prediction in Medical Volume, [Paper]
  • (arXiv 2022.04) Continual Hippocampus Segmentation with Transformers, [Paper]
  • (arXiv 2022.04) TranSiam: Fusing Multimodal Visual Features Using Transformer for Medical Image Segmentation, [Paper]
  • (arXiv 2022.05) Noise-reducing attention cross fusion learning transformer for histological image classification of osteosarcoma, [Paper]
  • (arXiv 2022.05) One Model to Synthesize Them All: Multi-contrast Multi-scale Transformer for Missing Data Imputation, [Paper]
  • (arXiv 2022.05) Unsupervised Contrastive Learning based Transformer for Lung Nodule Detection, [Paper]
  • (arXiv 2022.05) Understanding Transfer Learning for Chest Radiograph Clinical Report Generation with Modified Transformer Architectures, [Paper]
  • (arXiv 2022.05) Masked Co-attentional Transformer reconstructs 100x ultra-fast/low-dose whole-body PET from longitudinal images and anatomically guided MRI, [Paper]
  • (arXiv 2022.05) Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction, [Paper]
  • (arXiv 2022.05) A microstructure estimation Transformer inspired by sparse representation for diffusion MRI, [Paper]
  • (arXiv 2022.05) An Effective Transformer-based Solution for RSNA Intracranial Hemorrhage Detection Competition, [Paper],[Code]
  • (arXiv 2022.05) HoVer-Trans: Anatomy-aware HoVer-Transformer for ROI-free Breast Cancer Diagnosis in Ultrasound Images, [Paper]
  • (arXiv 2022.05) ColonFormer: An Efficient Transformer based Method for Colon Polyp Segmentation, [Paper]
  • (arXiv 2022.05) Transformer based multiple instance learning for weakly supervised histopathology image segmentation, [Paper]
  • (arXiv 2022.05) A graph-transformer for whole slide image classification, [Paper]
  • (arXiv 2022.05) BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video, [Paper],[Code]
  • (arXiv 2022.05) Transformer based Generative Adversarial Network for Liver Segmentation, [Paper]
  • (arXiv 2022.05) A Comparative Study of Gastric Histopathology Sub-size Image Classification: from Linear Regression to Visual Transformer, [Paper],[Code]
  • (arXiv 2022.05) Zero-Shot and Few-Shot Learning for Lung Cancer Multi-Label Classification using Vision Transformer, [Paper]
  • (arXiv 2022.06) The Fully Convolutional Transformer for Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2022.06) CellCentroidFormer: Combining Self-attention and Convolution for Cell Detection, [Paper],[Code]
  • (arXiv 2022.06) Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives, [Paper]
  • (arXiv 2022.06) CVM-Cervix: A Hybrid Cervical Pap-Smear Image Classification Framework Using CNN, Visual Transformer and Multilayer Perceptron, [Paper]
  • (arXiv 2022.06) MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet, [Paper],[Code]
  • (arXiv 2022.06) mmFormer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation, [Paper],[Code]
  • (arXiv 2022.06) Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation, [Paper]
  • (arXiv 2022.06) Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation, [Paper]
  • (arXiv 2022.06) Siamese Encoder-based Spatial-Temporal Mixer for Growth Trend Prediction of Lung Nodules on CT Scans, [Paper],[Code]
  • (arXiv 2022.06) Transformer-based Personalized Attention Mechanism (PersAM) for Medical Images with Clinical Records, [Paper]
  • (arXiv 2022.06) SwinCheX: Multi-label classification on chest X-ray images with transformers, [Paper]
  • (arXiv 2022.06) RPLHR-CT Dataset and Transformer Baseline for Volumetric Super-Resolution from CT Scans, [Paper],[Code]
  • (arXiv 2022.06) Transformer Lesion Tracker, [Paper],[Code]
  • (arXiv 2022.06) SeATrans: Learning Segmentation-Assisted diagnosis model via Transforme, [Paper]
  • (arXiv 2022.06) K-Space Transformer for Fast MRIReconstruction with Implicit Representation, [Paper],[Code]
  • (arXiv 2022.06) XMorpher: Full Transformer for Deformable Medical Image Registration via Cross Attention, [Paper],[Code]
  • (arXiv 2022.06) A Projection-Based K-space Transformer Network for Undersampled Radial MRI Reconstruction with Limited Training Subjects, [Paper]
  • (arXiv 2022.06) Rectify ViT Shortcut Learning by Visual Saliency, [Paper]
  • (arXiv 2022.06) Neural Transformers for Intraductal Papillary Mucosal Neoplasms (IPMN) Classification in MRI images, [Paper]
  • (arXiv 2022.06) Toward Unpaired Multi-modal Medical Image Segmentation via Learning Structured Semantic Consistency, [Paper],[Code]
  • (arXiv 2022.06) TransResU-Net: Transformer based ResU-Net for Real-Time Colonoscopy Polyp Segmentation, [Paper],[Code]
  • (arXiv 2022.06) SVoRT: Iterative Transformer for Slice-to-Volume Registration in Fetal Brain MRI, [Paper],[Code]
  • (arXiv 2022.06) ICOS Protein Expression Segmentation: Can Transformer Networks Give Better Results, [Paper]
  • (arXiv 2022.06) Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification, [Paper],[Code]
  • (arXiv 2022.06) Context-Aware Transformers For Spinal Cancer Detection and Radiological Grading, [Paper]
  • (arXiv 2022.06) The Lighter The Better: Rethinking Transformers in Medical Image Segmentation Through Adaptive Pruning, [Paper],[Code]
  • (arXiv 2022.06) C2FTrans: Coarse-to-Fine Transformers for Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2022.06) LViT: Language meets Vision Transformer in Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2022.06) PVT-COV19D: Pyramid Vision Transformer for COVID-19 Diagnosis, [Paper]
  • (arXiv 2022.07) Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches, [Paper],[Code]
  • (arXiv 2022.07) Efficient Lung Cancer Image Classification and Segmentation Algorithm Based on Improved Swin Transformer, [Paper]
  • (arXiv 2022.07) Spatiotemporal Feature Learning Based on Two-Step LSTM and Transformer for CT Scans, [Paper]
  • (arXiv 2022.07) Adaptive GLCM sampling for transformer-based COVID-19 detection on CT, [Paper]
  • (arXiv 2022.07) CNN-based Local Vision Transformer for COVID-19 Diagnosis, [Paper]
  • (arXiv 2022.07) Transformer based Models for Unsupervised Anomaly Segmentation in Brain MR Images, [Paper],[Code]
  • (arXiv 2022.07) CASHformer: Cognition Aware SHape Transformer for Longitudinal Analysis, [Paper]
  • (arXiv 2022.07) Swin Deformable Attention U-Net Transformer (SDAUT) for Explainable Fast MRI, [Paper],[Code]
  • (arXiv 2022.07) Multi-Label Retinal Disease Classification using Transformers, [Paper],[Code],[Dataset]
  • (arXiv 2022.07) TractoFormer: A Novel Fiber-level Whole Brain Tractography Analysis Framework Using Spectral Embedding and Vision Transformers, [Paper]
  • (arXiv 2022.07) Learning Apparent Diffusion Coefficient Maps from Undersampled Radial k-Space Diffusion-Weighted MRI in Mice using a Deep CNN-Transformer Model in Conjunction with a Monoexponential Model, [Paper]
  • (arXiv 2022.07) TFCNs: A CNN-Transformer Hybrid Network for Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2022.07) Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays, [Paper]
  • (arXiv 2022.07) RTN: Reinforced Transformer Network for Coronary CT Angiography Vessel-level Image Quality Assessment, [Paper]
  • (arXiv 2022.07) CKD-TransBTS: Clinical Knowledge-Driven Hybrid Transformer with Modality-Correlated Cross-Attention for Brain Tumor Segmentation, [Paper]
  • (arXiv 2022.07) Mobile Keystroke Biometrics Using Transformers, [Paper]
  • (arXiv 2022.07) Multi-head Cascaded Swin Transformers with Attention to k-space Sampling Pattern for Accelerated MRI Reconstruction, [Paper]
  • (arXiv 2022.07) HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2022.07) Focused Decoding Enables 3D Anatomical Detection by Transformers, [Paper],[Code]
  • (arXiv 2022.07) High-Resolution Swin Transformer for Automatic Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2022.07) Improved Super Resolution of MR Images Using CNNs and Vision Transformers, [Paper],[Code]
  • (arXiv 2022.07) TransNorm: Transformer Provides a Strong Spatial Normalization Mechanism for a Deep Segmentation Model, [Paper],[Code]
  • (arXiv 2022.07) ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2022.08) TransDeepLab: Convolution-Free Transformer-based DeepLab v3+ for Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2022.08) Multi-Feature Vision Transformer via Self-Supervised Representation Learning for Improvement of COVID-19 Diagnosis, [Paper],[Code]
  • (arXiv 2022.08) Self-Ensembling Vision Transformer (SEViT) for Robust Medical Image Classification, [Paper],[Code]
  • (arXiv 2022.08) BrainFormer: A Hybrid CNN-Transformer Model for Brain fMRI Data Classification, [Paper],[Code]
  • (arXiv 2022.08) U-Net vs Transformer: Is U-Net Outdated in Medical Image Registration, [Paper],[Code]
  • (arXiv 2022.08) Shifted Windows Transformers for Medical Image Quality Assessment, [Paper],[Code]
  • (arXiv 2022.08) Shuffle Instances-based Vision Transformer for Pancreatic Cancer ROSE Image Classification, [Paper],[Code]
  • (arXiv 2022.08) When CNN Meet with ViT: Towards Semi-Supervised Learning for Multi-Class Medical Image ation, [Paper], [Code]
  • (arXiv 2022.08) Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation, [Paper], [Code]
  • (arXiv 2022.08) FCN-Transformer Feature Fusion for Polyp Segmentation, [Paper], [Code]
  • (arXiv 2022.08) A Medical Semantic-Assisted Transformer for Radiographic Report Generation, [Paper], [Code]
  • (arXiv 2022.08) Multiple Instance Neuroimage Transformer, [Paper], [Code]
  • (arXiv 2022.08) Cats: Complementary CNN and Transformer Encoders for Segmentation, [Paper]
  • (arXiv 2022.08) Accurate and Robust Lesion RECIST Diameter Prediction and Segmentation with Transformers, [Paper]
  • (arXiv 2022.08) SB-SSL: Slice-Based Self-Supervised Transformers for Knee Abnormality Classification from MRI, [Paper]
  • (arXiv 2022.08) NestedFormer: Nested Modality-Aware Transformer for Brain Tumor Segmentation, [Paper], [Code]
  • (arXiv 2022.08) ARST: Auto-Regressive Surgical Transformer for Phase Recognition from Laparoscopic Videos, [Paper]
  • (arXiv 2022.09) Time-distance vision transformers in lung cancer diagnosis from longitudinal computed tomography, [Paper], [Code]
  • (arXiv 2022.09) Masked Sinogram Model with Transformer for ill-Posed Computed Tomography Reconstruction: a Preliminary Study, [Paper], [Code]
  • (arXiv 2022.09) Spach Transformer: Spatial and Channel-wise Transformer Based on Local and Global Self-attentions for PET Image Denoising, [Paper]
  • (arXiv 2022.09) View-Disentangled Transformer for Brain Lesion Detection, [Paper], [Code]
  • (arXiv 2022.09) CCTCOVID: COVID-19 Detection from Chest X-Ray Images Using Compact Convolutional Transformers, [Paper]
  • (arXiv 2022.09) Medical Image Captioning via Generative Pretrained Transformers, [Paper]
  • (arXiv 2022.09) UNesT: Local Spatial Representation Learning with Hierarchical Transformer for Efficient Medical Segmentation, [Paper], [Code]
  • (arXiv 2022.10) 3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2022.10) Gastrointestinal Disorder Detection with a Transformer Based Approach, [Paper]
  • (arXiv 2022.10) LAPFormer: A Light and Accurate Polyp Segmentation Transformer, [Paper]
  • (arXiv 2022.10) Memory transformers for full context and high-resolution 3D Medical Segmentation, [Paper]
  • (arXiv 2022.10) ConvTransSeg: A Multi-resolution Convolution-Transformer Network for Medical Image Segmentation, [Paper]
  • (arXiv 2022.10) Brain Network Transformer, [Paper], [Code]
  • (arXiv 2022.10) Wide Range MRI Artifact Removal with Transformers, [Paper]
  • (arXiv 2022.10) Optimizing Vision Transformers for Medical Image Segmentation and Few-Shot Domain Adaptation, [Paper]
  • (arXiv 2022.10) SimpleClick: Interactive Image Segmentation with Simple Vision Transformers, [Paper]
  • (arXiv 2022.10) Adversarial Transformer for Repairing Human Airway Segmentation, [Paper]
  • (arXiv 2022.10) Clinically-Inspired Multi-Agent Transformers for Disease Trajectory Forecasting from Multimodal Data, [Paper], [Code]
  • (arXiv 2022.10) Automatic Diagnosis of Myocarditis Disease in Cardiac MRI Modality using Deep Transformers and Explainable Artificial Intelligence, [Paper]
  • (arXiv 2022.10) Spatio-Temporal Hybrid Fusion of CAE and SWIn Transformers for Lung Cancer Malignancy Prediction, [Paper]
  • (arXiv 2022.10) Hyper-Connected Transformer Network for Co-Learning Multi-Modality PET-CT Features, [Paper]
  • (arXiv 2022.10) ImplantFormer: Vision Transformer based Implant Position Regression Using Dental CBCT Data, [Paper]
  • (arXiv 2022.10) Attention Swin U-Net: Cross-Contextual Attention Mechanism for Skin Lesion Segmentation, [Paper], [Code]
  • (arXiv 2022.10) TFormer: 3D Tooth Segmentation in Mesh Scans with Geometry Guided Transformer, [Paper], [Code]
  • (arXiv 2022.10) ViTASD: Robust Vision Transformer Baselines for Autism Spectrum Disorder Facial Diagnosis, [Paper], [Code]
  • (arXiv 2022.11) ViT-DeiT: An Ensemble Model for Breast Cancer Histopathological Images Classification, [Paper]
  • (arXiv 2022.11) RadFormer: Transformers with Global-Local Attention for Interpretable and Accurate Gallbladder Cancer Detection, [Paper], [Code]
  • (arXiv 2022.11) MultiCrossViT: Multimodal Vision Transformer for Schizophrenia Prediction using Structural MRI and Functional Network Connectivity Data, [Paper]
  • (arXiv 2022.11) ConvFormer: Combining CNN and Transformer for Medical Image Segmentation, [Paper]
  • (arXiv 2022.11) SWIN-SFTNet : Spatial Feature Expansion and Aggregation using Swin Transformer For Whole Breast micro-mass segmentation, [Paper]
  • (arXiv 2022.11) Parameter-Efficient Transformer with Hybrid Axial-Attention for Medical Image Segmentation, [Paper]
  • (arXiv 2022.11) TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis, [Paper]
  • (arXiv 2022.11) Unsupervised Echocardiography Registration through Patch-based MLPs and Transformers, [Paper], [Code]
  • (arXiv 2022.11) Towards Automated Polyp Segmentation Using Weakly- and Semi-Supervised Learning and Deformable Transformers, [Paper]
  • (arXiv 2022.11) Cross-Field Transformer for Diabetic Retinopathy Grading on Two-field Fundus Images, [Paper], [Code]
  • (arXiv 2022.11) Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics, [Paper]
  • (arXiv 2022.12) SLMT-Net: A Self-supervised Learning based Multi-scale Transformer Network for Cross-Modality MR Image Synthesis, [Paper], [Code]
  • (arXiv 2022.12) CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction, [Paper], [Code]
  • (arXiv 2022.12) Two-stage Contextual Transformer-based Convolutional Neural Network for Airway Extraction from CT Images, [Paper], [Code]
  • (arXiv 2022.12) Visual Transformers for Primates Classification and Covid Detection, [Paper]
  • (arXiv 2022.12) Conditioned Generative Transformers for Histopathology Image Synthetic Augmentation, [Paper]
  • (arXiv 2022.12) DuAT: Dual-Aggregation Transformer Network for Medical Image Segmentation, [Paper]
  • (arXiv 2022.12) Transformer and GAN Based Super-Resolution Reconstruction Network for Medical Images, [Paper]
  • (arXiv 2022.12) DAE-Former: Dual Attention-guided Efficient Transformer for Medical Image Segmentation, [Paper], [Code]
  • (arXiv 2023.01) A New Perspective to Boost Vision Transformer for Medical Image Classification, [Paper]
  • (arXiv 2023.01) Detecting Severity of Diabetic Retinopathy from Fundus Images using Ensembled Transformers, [Paper]
  • (arXiv 2023.01) MS-DINO: Efficient Distributed Training of Vision Transformer Foundation Model in Medical Domain through Masked Sampling, [Paper]
  • (arXiv 2023.01) Cooperation Learning Enhanced Colonic Polyp Segmentation Based on TransformerCNN Fusion, [Paper]
  • (arXiv 2023.01) ViT-AE++: Improving Vision Transformer Autoencoder for Self-supervised Medical Image Representations, [Paper]
  • (arXiv 2023.01) Fully transformer-based biomarker prediction from colorectal cancer histology: a large-scale multicentric study, [Paper]
  • (arXiv 2023.01) MultiNet with Transformers: A Model for Cancer Diagnosis Using Images, [Paper]
  • (arXiv 2023.01) TranSOP: Transformer-based Multimodal Classification for Stroke Treatment Outcome Prediction, [Paper]
  • (arXiv 2023.01) MedSegDiff-V2: Diffusion based Medical Image Segmentation with Transformer, [Paper], [Code]
  • (arXiv 2023.01) Enhancing Medical Image Segmentation with TransCeption: A Multi-Scale Feature Fusion Approach, [Paper], [Code]
  • (arXiv 2023.02) Efficient Scopeformer: Towards Scalable and Rich Feature Extraction for Intracranial Hemorrhage Detection, [Paper]
  • (arXiv 2023.02) LesionAid: Vision Transformers-based Skin Lesion Generation and Classification, [Paper]
  • (arXiv 2023.02) FCB-SwinV2 Transformer for Polyp Segmentation, [Paper]
  • (arXiv 2023.02) Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs, [Paper], [Code]
  • (arXiv 2023.02) SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor Segmentation in PET/CT Images, [Paper]
  • (arXiv 2023.02) Adapting Pre-trained Vision Transformers from 2D to 3D through Weight Inflation Improves Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2023.02) Bilateral-Fuser: A Novel Multi-cue Fusion Architecture with Anatomical-aware Tokens for Fovea Localization, [Paper]
  • (arXiv 2023.02) MedViT: A Robust Vision Transformer for Generalized Medical Image Classification, [Paper]
  • (arXiv 2023.02) SF2Former: Amyotrophic Lateral Sclerosis Identification From Multi-center MRI Data Using Spatial and Frequency Fusion Transformer,[Paper]
  • (arXiv 2023.02) Magnification Invariant Medical Image Analysis: A Comparison of Convolutional Networks, Vision Transformers, and Token Mixers, [Paper]
  • (arXiv 2023.02) A residual dense vision transformer for medical image super-resolution with segmentation-based perceptual loss fine-tuning, [Paper]
  • (arXiv 2023.02) StudyFormer : Attention-Based and Dynamic Multi View Classifier for X-ray images, [Paper]
  • (arXiv 2023.03) Meta-information-aware Dual-path Transformer for Differential Diagnosis of Multi-type Pancreatic Lesions in Multi-phase CT, [Paper]
  • (arXiv 2023.03) TRUSformer: Improving Prostate Cancer Detection from Micro-Ultrasound Using Attention and Self-Supervision, [Paper],[Code]
  • (arXiv 2023.03) UT-Net: Combining U-Net and Transformer for Joint Optic Disc and Cup Segmentation and Glaucoma Detection, [Paper]
  • (arXiv 2023.03) Generalized Diffusion MRI Denoising and Super-Resolution using Swin Transformers, [Paper],[Code]
  • (arXiv 2023.03) Pretrained ViTs Yield Versatile Representations For Medical Images, [Paper]
  • (arXiv 2023.03) Deformable Cross-Attention Transformer for Medical Image Registration, [Paper]
  • (arXiv 2023.03) Endoscopy Classification Model Using Swin Transformer and Saliency Map, [Paper]
  • (arXiv 2023.03) TransNetR: Transformer-based Residual Network for Polyp Segmentation with Multi-Center Out-of-Distribution Testing, [Paper],[Code]
  • (arXiv 2023.03) Efficiently Training Vision Transformers on Structural MRI Scans for Alzheimer's Disease Detection, [Paper]
  • (arXiv 2023.03) MATIS: Masked-Attention Transformers for Surgical Instrument Segmentation, [Paper]
  • (arXiv 2023.03) SwinVFTR: A Novel Volumetric Feature-learning Transformer for 3D OCT Fluid Segmentation, [Paper]
  • (arXiv 2023.03) MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image Segmentation, [Paper]
  • (arXiv 2023.03) GNNFormer: A Graph-based Framework for Cytopathology Report Generation, [Paper]
  • (arXiv 2023.03) Shifted-Windows Transformers for the Detection of Cerebral Aneurysms in Microsurgery, [Paper]
  • (arXiv 2023.03) CerviFormer: A Pap-smear based cervical cancer classification method using cross attention and latent transformer, [Paper]
  • (arXiv 2023.03) Convolutions, Transformers, and their Ensembles for the Segmentation of Organs at Risk in Radiation Treatment of Cervical Cancer, [Paper]
  • (arXiv 2023.03) HDformer: A Higher Dimensional Transformer for Diabetes Detection Utilizing Long Range Vascular Signals, [Paper]
  • (arXiv 2023.03) 3D Mitochondria Instance Segmentation with Spatio-Temporal Transformers, [Paper],[Code]
  • (arXiv 2023.03) Vision Transformer-based Model for Severity Quantification of Lung Pneumonia Using Chest X-ray Images, [Paper],[Code]
  • (arXiv 2023.03) Prior-RadGraphFormer: A Prior-Knowledge-Enhanced Transformer for Generating Radiology Graphs from X-Rays, [Paper]
  • (arXiv 2023.03) Few Shot Medical Image Segmentation with Cross Attention Transformer, [Paper]
  • (arXiv 2023.03) D-TrAttUnet: Dual-Decoder Transformer-Based Attention Unet Architecture for Binary and Multi-classes Covid-19 Infection Segmentation, [Paper]
  • (arXiv 2023.03) MoViT: Memorizing Vision Transformers for Medical Image Analysis, [Paper]
  • (arXiv 2023.03) Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation, [Paper]
  • (arXiv 2023.04) Devil is in the Queries: Advancing Mask Transformers for Real-world Medical Image Segmentation and Out-of-Distribution Localization, [Paper]
  • (arXiv 2023.04) EPVT: Environment-aware Prompt Vision Transformer for Domain Generalization in Skin Lesion Recognition, [Paper],[Code]
  • (arXiv 2023.04) U-Netmer: U-Net meets Transformer for medical image segmentation, [Paper]
  • (arXiv 2023.04) METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens, [Paper]
  • (arXiv 2023.04) HST-MRF: Heterogeneous Swin Transformer with Multi-Receptive Field for Medical Image Segmentation, [Paper]
  • (arXiv 2023.04) ForamViT-GAN: Exploring New Paradigms in Deep Learning for Micropaleontological Image Analysis, [Paper]
  • (arXiv 2023.04) Towards Evaluating Explanations of Vision Transformers for Medical Imaging, [Paper]
  • (arXiv 2023.04) Cross Attention Transformers for Multi-modal Unsupervised Whole-Body PET Anomaly Detection, [Paper]
  • (arXiv 2023.04) CAD-RADS scoring of coronary CT angiography with Multi-Axis Vision Transformer: a clinically-inspired deep learning pipeline, [Paper]
  • (arXiv 2023.04) Transformer with Selective Shuffled Position Embedding using ROI-Exchange Strategy for Early Detection of Knee Osteoarthritis, [Paper]
  • (arXiv 2023.04) Masked Pre-Training of Transformers for Histology Image Analysis, [Paper],[Code]
  • (arXiv 2023.04) Fibroglandular Tissue Segmentation in Breast MRI using Vision Transformers -- A multi-institutional evaluation, [Paper]
  • (arXiv 2023.04) Cross-Reference Transformer for Few-shot Medical Image Segmentation, [Paper]
  • (arXiv 2023.04) DeformableFormer: Classification of Endoscopic Ultrasound Guided Fine Needle Biopsy in Pancreatic Diseases, [Paper]
  • (arXiv 2023.04) Vision Transformer for Efficient Chest X-ray and Gastrointestinal Image Classification, [Paper]
  • (arXiv 2023.04) Dilated-UNet: A Fast and Accurate Medical Image Segmentation Approach using a Dilated Transformer and U-Net Architecture, [Paper],[Code]
  • (arXiv 2023.04) STM-UNet: An Efficient U-shaped Architecture Based on Swin Transformer and Multi-scale MLP for Medical Image Segmentation, [Paper]
  • (arXiv 2023.05) 3D Brainformer: 3D Fusion Transformer for Brain Tumor Segmentation, [Paper]
  • (arXiv 2023.05) Transformer-based interpretable multi-modal data fusion for skin lesion classification, [Paper]
  • (arXiv 2023.05) Cross-Shaped Windows Transformer with Self-supervised Pretraining for Clinically Significant Prostate Cancer Detection in Bi-parametric MRI, [Paper]
  • (arXiv 2023.05) Transformer-Based Hierarchical Clustering for Brain Network Analysis, [Paper],[Code]
  • (arXiv 2023.05) Brain Tumor Detection using Swin Transformers, [Paper]
  • (arXiv 2023.05) Transformers for CT Reconstruction From Monoplanar and Biplanar Radiographs, [Paper]
  • (arXiv 2023.05) Cascaded Cross-Attention Networks for Data-Efficient Whole-Slide Image Classification Using Transformers, [Paper]
  • (arXiv 2023.05) MaxViT-UNet: Multi-Axis Attention for Medical Image Segmentation, [Paper]
  • (arXiv 2023.05) LoViT: Long Video Transformer for Surgical Phase Recognition, [Paper]
  • (arXiv 2023.05) CB-HVTNet: A channel-boosted hybrid vision transformer network for lymphocyte assessment in histopathological images, [Paper]
  • (arXiv 2023.05) Multi-resolution Spatiotemporal Enhanced Transformer Denoising with Functional Diffusive GANs for Constructing Brain Effective Connectivity in MCI analysis, [Paper]
  • (arXiv 2023.05) Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery, [Paper],[Code]
  • (arXiv 2023.05) Coordinated Transformer with Position & Sample-aware Central Loss for Anatomical Landmark Detection, [Paper]
  • (arXiv 2023.05) HGT: A Hierarchical GCN-Based Transformer for Multimodal Periprosthetic Joint Infection Diagnosis Using CT Images and Text, [Paper]
  • (arXiv 2023.05) Prompt-based Tuning of Transformer Models for Multi-Center Medical Image Segmentation, [Paper]
  • (arXiv 2023.05) XTransCT: Ultra-Fast Volumetric CT Reconstruction using Two Orthogonal X-Ray Projections via a Transformer Network, [Paper]
  • (arXiv 2023.06) Prediction of Post-Operative Renal and Pulmonary Complication Using Transformers, [Paper]
  • (arXiv 2023.06) A Transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics, [Paper],[Code]
  • (arXiv 2023.06) A Novel Vision Transformer with Residual in Self-attention for Biomedical Image Classification, [Paper]
  • (arXiv 2023.06) Transformer-based Annotation Bias-aware Medical Image Segmentation, [Paper]
  • (arXiv 2023.06) Inflated 3D Convolution-Transformer for Weakly-supervised Carotid Stenosis Grading with Ultrasound Videos, [Paper]
  • (arXiv 2023.06) CiT-Net: Convolutional Neural Networks Hand in Hand with Vision Transformers for Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2023.06) TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2023.06) Enhancing COVID-19 Diagnosis through Vision Transformer-Based Analysis of Chest X-ray Images, [Paper]
  • (arXiv 2023.06) TransMRSR: Transformer-based Self-Distilled Generative Prior for Brain MRI Super-Resolution, [Paper],[Code]
  • (arXiv 2023.06) Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency for Survival Prediction, [Paper],[Code]
  • (arXiv 2023.06) SegT: A Novel Separated Edge-guidance Transformer Network for Polyp Segmentation, [Paper]
  • (arXiv 2023.06) KiUT: Knowledge-injected U-Transformer for Radiology Report Generation, [Paper]
  • (arXiv 2023.06) Concurrent ischemic lesion age estimation and segmentation of CT brain using a Transformer-based network, [Paper]
  • (arXiv 2023.06) CST-YOLO: A Novel Method for Blood Cell Detection Based on Improved YOLOv7 and CNN-Swin Transformer, [Paper],[Code]
  • (arXiv 2023.06) Taming Detection Transformers for Medical Object Detection, [Paper]
  • (arXiv 2023.06) CellViT: Vision Transformers for Precise Cell Segmentation and Classification, [Paper],[Code]
  • (arXiv 2023.06) HVTSurv: Hierarchical Vision Transformer for Patient-Level Survival Prediction from Whole Slide Image, [Paper],[Code]
  • (arXiv 2023.07) MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets, [Paper],[Code]
  • (arXiv 2023.07) Multi-Scale Prototypical Transformer for Whole Slide Image Classification, [Paper]
  • (arXiv 2023.07) Pretraining is All You Need: A Multi-Atlas Enhanced Transformer Framework for Autism Spectrum Disorder Classification, [Paper],[Code]
  • (arXiv 2023.07) H-DenseFormer: An Efficient Hybrid Densely Connected Transformer for Multimodal Tumor Segmentation, [Paper],[Code]
  • (arXiv 2023.07) Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck Cancer, [Paper]
  • (arXiv 2023.07) Source-Free Open-Set Domain Adaptation for Histopathological Images via Distilling Self-Supervised Vision Transformer, [Paper],[Code]
  • (arXiv 2023.07) Automatic diagnosis of knee osteoarthritis severity using Swin transformer, [Paper]
  • (arXiv 2023.07) Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering, [Paper],[Code]
  • (arXiv 2023.07) SwiFT: Swin 4D fMRI Transformer, [Paper]
  • (arXiv 2023.07) A Hierarchical Transformer Encoder to Improve Entire Neoplasm Segmentation on Whole Slide Image of Hepatocellular Carcinoma, [Paper]
  • (arXiv 2023.07) UGCANet: A Unified Global Context-Aware Transformer-based Network with Feature Alignment for Endoscopic Image Analysis, [Paper]
  • (arXiv 2023.07) RaBiT: An Efficient Transformer using Bidirectional Feature Pyramid Network with Reverse Attention for Colon Polyp Segmentation, [Paper]
  • (arXiv 2023.07) Transformer-based end-to-end classification of variable-length volumetric data, [Paper],[Code]
  • (arXiv 2023.07) TriFormer: A Multi-modal Transformer Framework For Mild Cognitive Impairment Conversion Prediction, [Paper]
  • (arXiv 2023.07) MUVF-YOLOX: A Multi-modal Ultrasound Video Fusion Network for Renal Tumor Diagnosis, [Paper],[Code]
  • (arXiv 2023.07) Study of Vision Transformers for Covid-19 Detection from Chest X-rays, [Paper]
  • (arXiv 2023.07) TUNeS: A Temporal U-Net with Self-Attention for Video-based Surgical Phase Recognition, [Paper]
  • (arXiv 2023.07) GLSFormer : Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos, [Paper]
  • (arXiv 2023.07) Dense Transformer based Enhanced Coding Network for Unsupervised Metal Artifact Reduction, [Paper]
  • (arXiv 2023.07) SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image Segmentation, [Paper], [Project]
  • (arxiv 2023.07) Pathology-and-genomics Multimodal Transformer for Survival Outcome Prediction, [Paper]
  • (arxiv 2023.07) SCPAT-GAN: Structural Constrained and Pathology Aware Convolutional Transformer-GAN for Virtual Histology Staining of Human Coronary OCT images, [Paper]
  • (arxiv 2023.07) Simulation of Arbitrary Level Contrast Dose in MRI Using an Iterative Global Transformer Model, [Paper]
  • (arXiv 2023.07) AViT: Adapting Vision Transformers for Small Skin Lesion Segmentation Datasets, [Paper]
  • (arXiv 2023.07) CoVid-19 Detection leveraging Vision Transformers and Explainable AI, [Paper]
  • (arXiv 2023.08) ViT2EEG: Leveraging Hybrid Pretrained Vision Transformers for EEG Data, [Paper]
  • (arXiv 2023.08) Ensemble Learning with Residual Transformer for Brain Tumor Segmentation, [Paper]
  • (arXiv 2023.08) DINO-CXR: A self supervised method based on vision transformer for chest X-ray classification, [Paper]
  • (arXiv 2023.08) Breast Ultrasound Tumor Classification Using a Hybrid Multitask CNN-Transformer Network, [Paper]
  • (arXiv 2023.08) IIHT: Medical Report Generation with Image-to-Indicator Hierarchical Transformer, [Paper]
  • (arXiv 2023.08) TriDo-Former: A Triple-Domain Transformer for Direct PET Reconstruction from Low-Dose Sinograms, [Paper]
  • (arXiv 2023.08) From CNN to Transformer: A Review of Medical Image Segmentation Models, [Paper]
  • (arXiv 2023.08) CheXFusion: Effective Fusion of Multi-View Features using Transformers for Long-Tailed Chest X-Ray Classification, [Paper],[Code]
  • (arXiv 2023.08) SDLFormer: A Sparse and Dense Locality-enhanced Transformer for Accelerated MR Image Reconstruction, [Paper],[Code]
  • (arXiv 2023.08) SEDA: Self-Ensembling ViT with Defensive Distillation and Adversarial Training for robust Chest X-rays Classification, [Paper]
  • (arXiv 2023.08) SkinDistilViT: Lightweight Vision Transformer for Skin Lesion Classification, [Paper],[Code]
  • (arXiv 2023.08) Dense Error Map Estimation for MRI-Ultrasound Registration in Brain Tumor Surgery Using Swin UNETR, [Paper]
  • (arXiv 2023.08) Towards Hierarchical Regional Transformer-based Multiple Instance Learning, [Paper]
  • (arXiv 2023.08) ConSlide: Asynchronous Hierarchical Interaction Transformer with Breakup-Reorganize Rehearsal for Continual Whole Slide Image Analysis, [Paper]
  • (arXiv 2023.08) GEMTrans: A General, Echocardiography-based, Multi-Level Transformer Framework for Cardiovascular Diagnosis, [Paper]
  • (arXiv 2023.08) Unlocking Fine-Grained Details with Wavelet-based High-Frequency Enhancement in Transformers, [Paper],[Code]
  • (arXiv 2023.08) CircleFormer: Circular Nuclei Detection in Whole Slide Images with Circle Queries and Attention, [Paper],[Code]
  • (arXiv 2023.08) Towards Optimal Patch Size in Vision Transformers for Tumor Segmentation, [Paper],[Code]
  • (arXiv 2023.09) Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care, [Paper]
  • (arXiv 2023.09) Beyond Self-Attention: Deformable Large Kernel Attention for Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2023.09) Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection, [Paper],[Code]
  • (arXiv 2023.09) Leveraging Self-Supervised Vision Transformers for Neural Transfer Function Design, [Paper]
  • (arXiv 2023.09) Multi-dimension unified Swin Transformer for 3D Lesion Segmentation in Multiple Anatomical Locations, [Paper]
  • (arXiv 2023.09) Improving diagnosis and prognosis of lung cancer using vision transformers: A scoping review, [Paper]
  • (arXiv 2023.09) Evaluation Kidney Layer Segmentation on Whole Slide Imaging using Convolutional Neural Networks and Transformers, [Paper]
  • (arXiv 2023.09) 3D Transformer based on deformable patch location for differential diagnosis between Alzheimer's disease and Frontotemporal dementia, [Paper]
  • (arXiv 2023.09) Enhancing Hierarchical Transformers for Whole Brain Segmentation with Intracranial Measurements Integration, [Paper],[Code]
  • (arXiv 2023.09) Phase-Specific Augmented Reality Guidance for Microscopic Cataract Surgery Using Long-Short Spatiotemporal Aggregation Transformer, [Paper]
  • (arXiv 2023.09) Few-Shot Medical Image Segmentation via a Region-enhanced Prototypical Transformer, [Paper],[Code]
  • (arXiv 2023.09) ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2023.09) UniBrain: Universal Brain MRI Diagnosis with Hierarchical Knowledge-enhanced Pre-training, [Paper]
  • (arXiv 2023.09) SAMUS: Adapting Segment Anything Model for Clinically-Friendly and Generalizable Ultrasound Image Segmentation, [Paper],[Code]
  • (arXiv 2023.09) HIGT: Hierarchical Interaction Graph-Transformer for Whole Slide Image Analysis, [Paper],[Code]
  • (arXiv 2023.09) Cross-Modal Synthesis of Structural MRI and Functional Connectivity Networks via Conditional ViT-GANs, [Paper]
  • (arXiv 2023.09) Image-level supervision and self-training for transformer-based cross-modality tumor segmentation, [Paper]
  • (arXiv 2023.09) MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image Segmentation, [Paper],[Code]
  • (arXiv 2023.09) Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer, [Paper]
  • (arXiv 2023.09) Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer, [Paper]
  • (arXiv 2023.09) AiAReSeg: Catheter Detection and Segmentation in Interventional Ultrasound using Transformers, [Paper]
  • (arXiv 2023.09) Cross-Modal Transformer GAN: Brain Structural-Functional Deep Fusing Network for Alzheimer's Disease Analysis, [Paper]
  • (arXiv 2023.10) MVC: A Multi-Task Vision Transformer Network for COVID-19 Diagnosis from Chest X-ray Images, [Paper]
  • (arXiv 2023.10) Pubic Symphysis-Fetal Head Segmentation Using Full Transformer with Bi-level Routing Attention, [Paper],[Code]
  • (arXiv 2023.10) RoFormer for Position Aware Multiple Instance Learning in Whole Slide Image Classification, [Paper],[Code]
  • (arXiv 2023.10) Multi-Dimension-Embedding-Aware Modality Fusion Transformer for Psychiatric Disorder Clasification, [Paper]
  • (arXiv 2023.10) Swin-Tempo: Temporal-Aware Lung Nodule Detection in CT Scans as Video Sequences Using Swin Transformer-Enhanced UNet, [Paper]
  • (arXiv 2023.10) Blind CT Image Quality Assessment Using DDPM-derived Content and Transformer-based Evaluator, [Paper]
  • (arXiv 2023.10) A Simple and Robust Framework for Cross-Modality Medical Image Segmentation applied to Vision Transformers, [Paper],[Code]
  • (arXiv 2023.10) TransCC: Transformer Network for Coronary Artery CCTA Segmentation, [Paper]
  • (arXiv 2023.10) HydraViT: Adaptive Multi-Branch Transformer for Multi-Label Disease Classification from Chest X-ray Images, [Paper]
  • (arXiv 2023.10) COVID-19 Detection Using Swin Transformer Approach from Computed Tomography Images, [Paper]
  • (arXiv 2023.10) 3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers, [Paper],[Code]
  • (arXiv 2023.10) Faster 3D cardiac CT segmentation with Vision Transformers, [Paper],[Code]
  • (arXiv 2023.10) Tackling Heterogeneity in Medical Federated learning via Vision Transformers, [Paper]
  • (arXiv 2023.10) A Multi-Scale Spatial Transformer U-Net for Simultaneously Automatic Reorientation and Segmentation of 3D Nuclear Cardiac Images, [Paper]
  • (arXiv 2023.10) SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical Image Segmentation, [Paper]
  • (arXiv 2023.10) Heart Disease Detection using Vision-Based Transformer Models from ECG Images, [Paper]
  • (arXiv 2023.10) Predicting Ovarian Cancer Treatment Response in Histopathology using Hierarchical Vision Transformers and Multiple Instance Learning, [Paper]
  • (arXiv 2023.10) DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation, [Paper]
  • (arXiv 2023.10) Skin Lesion Segmentation Improved by Transformer-based Networks with Inter-scale Dependency Modeling, [Paper],[Code]
  • (arXiv 2023.10) Prompt-based Grouping Transformer for Nucleus Detection and Classification, [Paper]
  • (arXiv 2023.10) Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection, [Paper], [Code]
  • (arXiv 2023.10) Inter-Scale Dependency Modeling for Skin Lesion Segmentation with Transformer-based Networks, [Paper]
  • (arXiv 2023.10) Ophthalmic Biomarker Detection Using Ensembled Vision Transformers, [Paper]
  • (arXiv 2023.10) What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for Pathological Image Captioning, [Paper]
  • (arXiv 2023.10) MIST: Medical Image Segmentation Transformer with Convolutional Attention Mixing (CAM) Decoder, [Paper], [Code]
  • (arXiv 2023.10) Muscle volume quantification: guiding transformers with anatomical priors, [Paper]
  • (arXiv 2023.10) fMRI-PTE: A Large-scale fMRI Pretrained Transformer Encoder for Multi-Subject Brain Activity Decoding, [Paper]
  • (arXiv 2023.11) Hybrid-Fusion Transformer for Multisequence MRI, [Paper]
  • (arXiv 2023.11) Capturing Local and Global Features in Medical Images by Using Ensemble CNN-Transformer, [Paper]
  • (arXiv 2023.11) Leveraging Transformers to Improve Breast Cancer Classification and Risk Assessment with Multi-modal and Longitudinal Data, [Paper]
  • (arXiv 2023.11) Transformer-based Model for Oral Epithelial Dysplasia Segmentation, [Paper]
  • (arXiv 2023.11) TransReg: Cross-transformer as auto-registration module for multi-view mammogram mass detection, [Paper]
  • (arXiv 2023.11) Automatic Report Generation for Histopathology images using pre-trained Vision Transformers, [Paper]
  • (arXiv 2023.11) SynthEnsemble: A Fusion of CNN, Vision Transformer, and Hybrid Models for Multi-Label Chest X-Ray Classification, [Paper]
  • (arXiv 2023.11) LT-ViT: A Vision Transformer for multi-label Chest X-ray classification, [Paper]
  • (arXiv 2023.11) Swin UNETR++: Advancing Transformer-Based Dense Dose Prediction Towards Fully Automated Radiation Oncology Treatments, [Paper]
  • (arXiv 2023.11) TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival Prediction, [Paper]
  • (arXiv 2023.11) MARformer: An Efficient Metal Artifact Reduction Transformer for Dental CBCT Images, [Paper]
  • (arXiv 2023.11) Harnessing Transformers: A Leap Forward in Lung Cancer Image Detection, [Paper]
  • (arXiv 2023.11) Semi-supervised ViT knowledge distillation network with style transfer normalization for colorectal liver metastases survival prediction, [Paper]
  • (arXiv 2023.11) PMP-Swin: Multi-Scale Patch Message Passing Swin Transformer for Retinal Disease Classification, [Paper]
  • (arXiv 2023.11) MGCT: Mutual-Guided Cross-Modality Transformer for Survival Outcome Prediction using Integrative Histopathology-Genomic Features, [Paper]
  • (arXiv 2023.11) Radiology Report Generation Using Transformers Conditioned with Non-imaging Data, [Paper]
  • (arXiv 2023.11) Enhancing Transformer-Based Segmentation for Breast Cancer Diagnosis using Auto-Augmentation and Search Optimisation Techniques, [Paper]
  • (arXiv 2023.11) TSegFormer: 3D Tooth Segmentation in Intraoral Scans with Geometry Guided Transformer, [Paper], [Code]
  • (arXiv 2023.11) Adapting Segment Anything Model (SAM) through Prompt-based Learning for Enhanced Protein Identification in Cryo-EM Micrographs, [Paper]
  • (arXiv 2023.12) Brainformer: Modeling MRI Brain Functions to Machine Vision, [Paper]
  • (arXiv 2023.12) Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers, [Paper]
  • (arXiv 2023.12) MobileUtr: Revisiting the relationship between light-weight CNN and Transformer for efficient medical image segmentation, [Paper], [Code]
  • (arXiv 2023.12) Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT, [Paper], [Code]
  • (arXiv 2023.12) Predicting Bone Degradation Using Vision Transformer and Synthetic Cellular Microstructures Dataset, [Paper]
  • (arXiv 2023.12) Adjustable Robust Transformer for High Myopia Screening in Optical Coherence Tomography, [Paper],[Code]
  • (arXiv 2023.12) Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images, [Paper],[Code]
  • (arXiv 2023.12) SP-DiffDose: A Conditional Diffusion Model for Radiation Dose Prediction Based on Multi-Scale Fusion of Anatomical Structures, Guided by SwinTransformer and Projector, [Paper]
  • (arXiv 2023.12) Pre-trained Universal Medical Image Transformer, [Paper],[Code]
  • (arXiv 2023.12) Vision Transformer-Based Deep Learning for Histologic Classification of Endometrial Cancer, [Paper]
  • (arXiv 2023.12) Brain Diffuser with Hierarchical Transformer for MCI Causality Analysis, [Paper]
  • (arXiv 2023.12) Glioblastoma Tumor Segmentation using an Ensemble of Vision Transformers, [Paper]
  • (arXiv 2023.12) Hierarchical Vision Transformers for Context-Aware Prostate Cancer Grading in Whole Slide Images, [Paper]
  • (arXiv 2024.01) BRAU-Net++: U-Shaped Hybrid CNN-Transformer Network for Medical Image Segmentation, [Paper][Code]
  • (arXiv 2024.01) Accurate Leukocyte Detection Based on Deformable-DETR and Multi-Level Feature Fusion for Aiding Diagnosis of Blood Diseases, [Paper][Code]
  • (arXiv 2024.01) A novel method to enhance pneumonia detection via a model-level ensembling of CNN and vision transformer, [Paper]
  • (arXiv 2024.01) Vision Transformers and Bi-LSTM for Alzheimer's Disease Diagnosis from 3D MRI, [Paper]
  • (arXiv 2024.01) Derm-T2IM: Harnessing Synthetic Skin Lesion Data via Stable Diffusion Models for Enhanced Skin Disease Classification using ViT and CNN, [Paper]
  • (arXiv 2024.01) Skin Cancer Segmentation and Classification Using Vision Transformer for Automatic Analysis in Dermatoscopy-based Non-invasive Digital System, [Paper]
  • (arXiv 2024.01) Transformer-CNN Fused Architecture for Enhanced Skin Lesion Segmentation, [Paper]
  • (arXiv 2024.01) MedTransformer: Accurate AD Diagnosis for 3D MRI Images through 2D Vision Transformers, [Paper]
  • (arXiv 2024.01) D-STGCNT: A Dense Spatio-Temporal Graph Conv-GRU Network based on transformer for assessment of patient physical rehabilitation, [Paper]

Mesh

  • (arXiv 2022.07) Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers, [Paper], [Code]
  • (arXiv 2022.11) TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer, [Paper]
  • (arXiv 2023.03) GATOR: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose, [Paper]
  • (arXiv 2023.03) DDT: A Diffusion-Driven Transformer-based Framework for Human Mesh Recovery from a Video, [Paper]
  • (arXiv 2023.03) POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery, [Paper], [Project]
  • (arXiv 2023.03) One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer, [Paper], [Project]
  • (arXiv 2023.07) MeT: A Graph Transformer for ation of 3D Meshes, [Paper], [Project]
  • (arXiv 2023.07) 3Deformer: A Common Framework for Image-Guided Mesh Deformation, [Paper], [Project]
  • (arXiv 2023.07) JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery, [Paper], [Code]
  • (arXiv 2023.08) Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos, [Paper], [Code]
  • (arXiv 2023.11) MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers, [Paper], [Project]

Metric learning

  • (arXiv 2022.03) Hyperbolic Vision Transformers: Combining Improvements in Metric Learning, [Paper],[Code]

Motion

  • (arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]
  • (arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]
  • (arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
  • (arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper]
  • (arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]
  • (arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]
  • (arXiv 2022.03) ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation, [Paper]
  • (arXiv 2022.03) Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs, [Paper]
  • (arXiv 2022.03) Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation, [Paper]
  • (arXiv 2022.04) HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE, [Paper]
  • (arXiv 2022.07) TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction, [Paper]
  • (arXiv 2022.08) SoMoFormer: Social-Aware Motion Transformer for Multi-Person Motion Prediction, [Paper]
  • (arXiv 2022.09) Motion Transformer with Global Intention Localization and Local Movement Refinement, [Paper], [Code]
  • (arXiv 2022.09) NEURAL MARIONETTE: A Transformer-based Multi-action Human Motion Synthesis System, [Paper], [Project]
  • (arXiv 2022.09) Motion Transformer for Unsupervised Image Animation, [Paper], [Code]
  • (arXiv 2022.11) Blur Interpolation Transformer for Real-World Motion from Blur, [Paper]
  • (arXiv 2022.12) Transformer-Based Learned Optimization, [Paper]
  • (arXiv 2023.01) Diagnose Like a Pathologist: Transformer-Enabled Hierarchical Attention-Guided Multiple Instance Learning for Whole Slide Image Classification, [Paper]
  • (arXiv 2023.02) Robust Human Motion Forecasting using Transformer-based Model, [Paper]
  • (arXiv 2023.02) STB-VMM: Swin Transformer Based Video Motion Magnification, [Paper]
  • (arXiv 2023.02) Human MotionFormer: Transferring Human Motions with Vision Transformers, [Paper], [Project]
  • (arXiv 2023.02) Multi-Scale Control Signal-Aware Transformer for Motion Synthesis without Phase, [Paper]
  • (arXiv 2023.03) SPOTR: Spatio-temporal Pose Transformers for Human Motion Prediction, [Paper]
  • (arXiv 2023.04) BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation, [Paper], [Code]
  • (arXiv 2023.05) XFormer: Fast and Accurate Monocular 3D Body Capture, [Paper]
  • (arXiv 2023.05) Imitating Task and Motion Planning with Visuomotor Transformers, [Paper], [Code]
  • (arXiv 2023.06) PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction, [Paper]
  • (arXiv 2023.06) ModeT: Learning Deformable Image Registration via Motion Decomposition Transformer, [Paper], [Code]
  • (arXiv 2023.07) TransFusion: A Practical and Effective Transformer-based Diffusion Model for 3D Human Motion Prediction, [Paper]
  • (arXiv 2023.08) Joint-Relation Transformer for Multi-Person Motion Prediction, [Paper], [Code]
  • (arXiv 2023.08) A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis, [Paper], [Code]
  • (arXiv 2023.10) Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding, [Paper], [Code]
  • (arXiv 2023.11) Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement, [Paper]
  • (arXiv 2023.12) MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR, [Paper], [Code]
  • (arXiv 2023.12) EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer, [Paper], [Code]
  • (arXiv 2023.12) Sign Language Production with Latent Motion Transformer, [Paper]
  • (arXiv 2024.01) AdvMT: Adversarial Motion Transformer for Long-term Human Motion Prediction, [Paper]

Multi-label

  • (arXiv 2021.06) MlTr: Multi-label Classification with Transformer, [Paper], [Code]
  • (arXiv 2021.07) Query2Label: A Simple Transformer Way to Multi-Label Classification, [Paper], [Code]
  • (arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper], [Code]
  • (arXiv 2020.11) General Multi-label Image Classification with Transformers, [Paper]
  • (arXiv 2022.03) Graph Attention Transformer Network for Multi-Label Image Classification, [Paper]
  • (arXiv 2022.03) Incomplete Multi-View Multi-Label Learning via Label-Guided Masked Viewand Category-Aware Transformers, [Paper]
  • (arXiv 2023.09) Multi-Label Feature Selection Using Adaptive and Transformed Relevance, [Paper]

Multi-task/modal

  • (arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Code]
  • (arXiv 2021.04) MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding, [Paper], [Code]
  • (arXiv 2021.04) Multi-Modal Fusion Transformer for End-to-End Autonomous Driving, [Paper]
  • (arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper]
  • (arXiv 2021.04) Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning, [Paper]
  • (arXiv 2021.06) Scene Transformer: A Unified Multi-task Model for Behavior Prediction and Planning, [Paper]
  • (arXiv 2021.06) Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation, [Paper]
  • (arXiv 2021.06) A Transformer-based Cross-modal Fusion Model with Adversarial Training, [Paper]
  • (arXiv 2021.07) Attention Bottlenecks for Multimodal Fusion, [Paper]
  • (arXiv 2021.07) Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots, [Paper]
  • (arXiv 2021.07) Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions, [Paper]
  • (arXiv 2021.07) Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers, [Paper], [Code]
  • (arXiv 2021.08) StrucTexT: Structured Text Understanding with Multi-Modal Transformers, [Paper]
  • (arXiv 2021.08) Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations, [Paper]
  • (arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
  • (arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
  • (arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
  • (arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
  • (arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
  • (arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]
  • (arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper], [Code]
  • (arXiv 2021.10) VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language Transformer Decomposing, [Paper]
  • (arXiv 2021.10) Detecting Dementia from Speech and Transcripts using Transformers, [Paper]
  • (arXiv 2021.11) MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition, [Paper]
  • (arXiv 2021.11) VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]
  • (arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]
  • (arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]
  • (arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code1], [Code2]
  • (arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]
  • (arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]
  • (arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]
  • (arXiv 2021.11) VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]
  • (arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
  • (arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]
  • (arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]
  • (arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper], [Code]
  • (arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]
  • (arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]
  • (arXiv 2021.12) VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling, [Paper]
  • (arXiv 2021.12) VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper],[Code]
  • (arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]
  • (arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper],[Code]
  • (arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]
  • (arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper],[Code]
  • (arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper],[Code]
  • (arXiv 2022.01) Robust Self-Supervised Audio-Visual Speech Recognition, [Paper],[Code]
  • (arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]
  • (arXiv 2022.01) Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning, [Paper],[Code]
  • (arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper],[Code]
  • (arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper],[Code]
  • (arXiv 2022.01) A Pre-trained Audio-Visual Transformer for Emotion Recognition, [Paper]
  • (arXiv 2022.01) Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition, [Paper]
  • (arXiv 2022.02) Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer, [Paper]
  • (arXiv 2022.03) DXM-TransFuse U-net: Dual Cross-Modal Transformer Fusion U-net for Automated Nerve Identification, [Paper]
  • (arXiv 2022.03) LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives, [Paper]
  • (arXiv 2022.03) VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer, [Paper],[Project]
  • (arXiv 2022.03) MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization, [Paper],[Project]
  • (arXiv 2022.03) Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation, [Paper]
  • (arXiv 2022.03) Inverted Pyramid Multi-task Transformer for Dense Scene Understanding, [Paper]
  • (arXiv 2022.03) UNIMO-2: End-to-End Unified Vision-Language Grounded Learning, [Paper],[Project]
  • (arXiv 2022.03) Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers, [Paper]
  • (arXiv 2022.03) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection, [Paper],[Project]
  • (arXiv 2022.03) Multi-modal Multi-label Facial Action Unit Detection with Transformer, [Paper],[Project]
  • (arXiv 2022.03) Multimodal Fusion Transformer for Remote Sensing Image Classification, [Paper]
  • (arXiv 2022.03) VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, [Paper]
  • (arXiv 2022.04) MultiMAE: Multi-modal Multi-task Masked Autoencoders, [Paper],[Project]
  • (arXiv 2022.04) Multi-Task Distributed Learning using Vision Transformer with Random Patch Permutation, [Paper]
  • (arXiv 2022.04) MHMS: Multimodal Hierarchical Multimedia Summarization, [Paper]
  • (arXiv 2022.04) Multimodal Transformer for Nursing Activity Recognition, [Paper]
  • (arXiv 2022.04) Are Multimodal Transformers Robust to Missing Modality?, [Paper]
  • (arXiv 2022.04) X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks, [Paper]
  • (arXiv 2022.04) Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks, [Paper]
  • (arXiv 2022.04) Multimodal Token Fusion for Vision Transformers, [Paper]
  • (arXiv 2022.04) Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval, [Paper], [Code]
  • (arXiv 2022.04) ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer, [Paper]
  • (arXiv 2022.05) Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training, [Paper]
  • (arXiv 2022.05) MulT: An End-to-End Multitask Learning Transformer, [Paper]), [Project]
  • (arXiv 2022.05) Training Vision-Language Transformers from Captions Alone, [Paper], [Code]
  • (arXiv 2022.05) GIT: A Generative Image-to-text Transformer for Vision and Language, [Paper]
  • (arXiv 2022.05) Multi-Task Learning with Multi-query Transformer for Dense Prediction, [Paper]
  • (arXiv 2022.06) VL-BEIT: Generative Vision-Language Pretraining, [Paper], [Code]
  • (arXiv 2022.06) AntPivot: Livestream Highlight Detection via Hierarchical Attention Mechanism, [Paper]
  • (arXiv 2022.06) A Unified Sequence Interface for Vision Tasks, [Paper]
  • (arXiv 2022.06) Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos, [Paper]
  • (arXiv 2022.06) Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks, [Paper]
  • (arXiv 2022.06) M&M Mix: A Multimodal Multiview Transformer Ensemble, [Paper]
  • (arXiv 2022.06) RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval, [Paper]
  • (arXiv 2022.07) You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers, [Paper], [Code]
  • (arXiv 2022.07) Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge Transfer, [Paper], [Code]
  • (arXiv 2022.07) Audio−Visual Segmentation, [Paper], [Code]
  • (arXiv 2022.07) FashionViL: Fashion-Focused Vision-and-Language Representation Learning, [Paper], [Code]
  • (arXiv 2022.07) Multimodal Transformer for Automatic 3D Annotation and Object Detection, [Paper], [Code]
  • (arXiv 2022.07) UFO: Unified Feature Optimization, [Paper], [Code]
  • (arXiv 2022.07) An Ensemble Approach for Multiple Emotion Descriptors Estimation Using Multi-task Learning, [Paper]
  • (arXiv 2022.07) STrajNet: Occupancy Flow Prediction via Multi-modal Swin Transformer, [Paper]
  • (arXiv 2022.08) Multi-Task Transformer with uncertainty modelling for Face Based Affective Computing, [Paper]
  • (arXiv 2022.08) Multi-modal Transformer Path Prediction for Autonomous Vehicle, [Paper]
  • (arXiv 2022.08) Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis, [Paper]
  • (arXiv 2022.08) VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations, [Paper], [Code]
  • (arXiv 2022.08) Flat Multi-modal Interaction Transformer for Named Entity Recognition, [Paper], [Code]
  • (arXiv 2022.08) TFusion: Transformer based N-to-One Multimodal Fusion Block, [Paper]
  • (arXiv 2022.09) Multi-task Swin Transformer for Motion Artifacts Classification and Cardiac Magnetic Resonance Image Segmentation, [Paper]
  • (arXiv 2022.09) TMSS: An End-to-End Transformer-based Multimodal Network for Segmentation and Survival Prediction, [Paper], [Code]
  • (arXiv 2022.09) Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer, [Paper], [Code]
  • (arXiv 2022.09) UniColor: A Unified Framework for Multi-Modal Colorization with Transformer, [Paper], [Code]
  • (arXiv 2022.09) TVLT: Textless Vision-Language Transformer, [Paper], [Code]
  • (arXiv 2022.10) A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers, [Paper]
  • (arXiv 2022.10) Cascaded Multi-Modal Mixing Transformers for Alzheimer's Disease Classification with Incomplete Data, [Paper]
  • (arXiv 2022.10) VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment, [Paper]
  • (arXiv 2022.10) Transformer-based Localization from Embodied Dialog with Large-scale Pre-training, [Paper]
  • (arXiv 2022.10) AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization, [Paper]
  • (arXiv 2022.10) Understanding Embodied Reference with Touch-Line Transformer, [Paper]
  • (arXiv 2022.10) Foundation Transformers, [Paper]
  • (arXiv 2022.10) PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning, [Paper]
  • (arXiv 2022.10) Multimodal Image Fusion based on Hybrid CNN-Transformer and Non-local Cross-modal Attention, [Paper], [Code]
  • (arXiv 2022.10) Multi-Source Transformer Architectures for Audiovisual Scene Classification, [Paper]
  • (arXiv 2022.10) Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies, [Paper]
  • (arXiv 2022.10) M3ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design, [Paper], [Code]
  • (arXiv 2022.10) TAMFormer: Multi-Modal Transformer with Learned Attention Mask for Early Intent Prediction, [Paper], [Code]
  • (arXiv 2022.10) Multimodal Transformer Distillation for Audio-Visual Synchronization, [Paper]
  • (arXiv 2022.10) Masked Vision-Language Transformer in Fashion, [Paper], [Code]
  • (arXiv 2022.10) Multimodal Transformer for Parallel Concatenated Variational Autoencoders, [Paper]
  • (arXiv 2022.10) RCDPT: Radar-Camera fusion Dense Prediction Transformer, [Paper]
  • (arXiv 2022.11) Efficient Joint Detection and Multiple Object Tracking with Spatially Aware Transformer, [Paper]
  • (arXiv 2022.11) OneFormer: One Transformer to Rule Universal Image Segmentation, [Paper], [Code]
  • (arXiv 2022.11) TransCC: Transformer-based Multiple Illuminant Color Constancy Using Multitask Learning, [Paper]
  • (arXiv 2022.11) Unifying Vision-Language Representation Space with Single-tower Transformer, [Paper]
  • (arXiv 2022.11) Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion, [Paper], [Code]
  • (arXiv 2022.12) Multimodal Vision Transformers with Forced Attention for Behavior Analysis, [Paper]
  • (arXiv 2022.12) Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers, [Paper], [Code]
  • (arXiv 2022.12) Hierarchical multimodal transformers for Multi-Page DocVQA, [Paper]
  • (arXiv 2022.12) Vision Transformers are Parameter-Efficient Audio-Visual Learners, [Paper], [Code]
  • (arXiv 2022.12) Neural Shape Compiler: A Unified Framework for Transforming between Text, Point Cloud, and Program, [Paper]
  • (arXiv 2023.01) Cross Modal Transformer via Coordinates Encoding for 3D Object Dectection, [Paper], [Code]
  • (arXiv 2023.01) DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction, [Paper], [Code]
  • (arXiv 2023.01) Multi-scale multi-modal micro-expression recognition algorithm based on transformer, [Paper]
  • (arXiv 2023.01) Logically at Factify 2023: A Multi-Modal Fact Checking System Based on Evidence Retrieval techniques and Transformer Encoder Architecture, [Paper]
  • (arXiv 2023.01) ViTs for SITS: Vision Transformers for Satellite Image Time Series, [Paper]
  • (arXiv 2023.01) Zorro: the masked multimodal transformer, [Paper]
  • (arXiv 2023.01) Multimodal Event Transformer for Image-guided Story Ending Generation, [Paper]
  • (arXiv 2023.01) UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers, [Paper]
  • (arXiv 2023.02) Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing, [Paper]
  • (arXiv 2023.02) ViM: Vision Middleware for Unified Downstream Transferring, [Paper]
  • (arXiv 2023.03) One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale, [Paper], [Code]
  • (arXiv 2023.03) MAGVLT: Masked Generative Vision-and-Language Transformer, [Paper]
  • (arXiv 2023.03) LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR Perception, [Paper]
  • (arXiv 2023.03) MMFormer: Multimodal Transformer Using Multiscale Self-Attention for Remote Sensing Image Classification, [Paper]
  • (arXiv 2023.04) Longitudinal Multimodal Transformer Integrating Imaging and Latent Clinical Signatures From Routine EHRs for Pulmonary Nodule Classification, [Paper]
  • (arXiv 2023.04) PARFormer: Transformer-based Multi-Task Network for Pedestrian Attribute Recognition, [Paper], [Code]
  • (arXiv 2023.04) AutoTaskFormer: Searching Vision Transformers for Multi-task Learning, [Paper]
  • (arXiv 2023.05) MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer, [Paper], [Code]
  • (arXiv 2023.05) MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis, [Paper], [Project]
  • (arXiv 2023.05) JOINEDTrans: Prior Guided Multi-task Transformer for Joint Optic Disc/Cup Segmentation and Fovea Detection, [Paper], [Project]
  • (arXiv 2023.05) Brain encoding models based on multimodal transformers can transfer across language and vision, [Paper]
  • (arXiv 2023.05) CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers, [Paper], [Code]
  • (arXiv 2023.05) Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts, [Paper], [Code]
  • (arXiv 2023.06) Transformer-based Multi-Modal Learning for Multi Label Remote Sensing Image Classification, [Paper], [Code]
  • (arXiv 2023.06) Energy-Based Models for Cross-Modal Localization using Convolutional Transformers, [Paper]
  • (arXiv 2023.06) Efficient Multi-Task Scene Analysis with RGB-D Transformers, [Paper], [Code]
  • (arXiv 2023.06) ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction with Multimodal Transformer, [Paper]
  • (arXiv 2023.07) End-To-End Prediction of Knee Osteoarthritis Progression With Multi-Modal Transformers, [Paper]
  • (arXiv 2023.07) Interactive Image Segmentation with Cross-Modality Vision Transformers, [Paper], [Code]
  • (arXiv 2023.07) TransNuSeg: A Lightweight Multi-Task Transformer for Nuclei Segmentation, [Paper], [Code]
  • (arXiv 2023.07) Meta-Transformer: A Unified Framework for Multimodal Learning, [Paper], [Project]
  • (arXiv 2023.07) ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer, [Paper], [Code]
  • (arXiv 2023.07) Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation, [Paper]
  • (arXiv 2023.07) Prompt Guided Transformer for Multi-Task Dense Prediction, [Paper]
  • (arXiv 2023.07) Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics, [Paper]
  • (arXiv 2023.08) FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving, [Paper]
  • (arXiv 2023.08) Multimodal Neurons in Pretrained Text-Only Transformers, [Paper], [Project]
  • (arXiv 2023.08) A vision transformer-based framework for knowledge transfer from multi-modal to mono-modal lymphoma subtyping models, [Paper]
  • (arXiv 2023.08) 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment, [Paper], [Project]
  • (arXiv 2023.08) Vision Transformer Adapters for Generalizable Multitask Learning, [Paper], [Code]
  • (arXiv 2023.08) UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for Temporal Forgery Localization, [Paper], [Code]
  • (arXiv 2023.09) Exchanging-based Multimodal Fusion with Transformer, [Paper], [Code]
  • (arXiv 2023.09) Multimodal Transformer for Material Segmentation, [Paper], [Code]
  • (arXiv 2023.09) Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens, [Paper]
  • (arXiv 2023.09) MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer, [Paper], [Code]
  • (arXiv 2023.09) Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation, [Paper]
  • (arXiv 2023.09) Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer, [Paper]
  • (arXiv 2023.10) LeTFuser: Light-weight End-to-end Transformer-Based Sensor Fusion for Autonomous Driving with Multi-Task Learning, [Paper], [Code]
  • (arXiv 2023.10) 3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for Embodied Turn-Taking Prediction, [Paper]
  • (arXiv 2023.10) MMTF-DES: A Fusion of Multimodal Transformer Models for Desire, Emotion, and Sentiment Analysis of Social Media Data, [Paper]
  • (arXiv 2023.11) Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation, [Paper]
  • (arXiv 2023.11) Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task Learning with Auxiliary Mutual Information Maximization, [Paper]
  • (arXiv 2023.11) PolyMaX: General Dense Prediction with Mask Transformer, [Paper]
  • (arXiv 2023.11) Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain, [Paper]
  • (arXiv 2023.11) Language Grounded QFormer for Efficient Vision Language Understanding, [Paper]
  • (arXiv 2023.11) DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models, [Paper]
  • (arXiv 2023.11) VIT-LENS-2: Gateway to Omni-modal Intelligence, [Paper], [[Code]](https://github.com/TencentARC/ViT-Lens)
  • (arXiv 2023.11) You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception, [Paper]
  • (arXiv 2023.12) VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation, [Paper]
  • (arXiv 2024.01) Multimodal Informative ViT: Information Aggregation and Distribution for Hyperspectral and LiDAR Classification, [Paper], [Code]
  • (arXiv 2024.01) SeTformer is What You Need for Vision and Language, [Paper]

Multi-view Stereo

  • (arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]
  • (arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]
  • (arXiv 2022.04) MVSTER: Epipolar Transformer for Efficient Multi-View Stereo, [Paper], [Code]
  • (arXiv 2022.05) WT-MVSNet: Window-based Transformers for Multi-view Stereo, [Paper], [Code]
  • (arXiv 2022.08) MVSFormer: Learning Robust Image Representations via Transformers and Temperature-based Depth for Multi-View Stereo, [Paper]
  • (arXiv 2022.08) A Light Touch Approach to Teaching Transformers Multi-view Geometry, [Paper]
  • (arXiv 2023.03) Implicit Ray-Transformers for Multi-view Remote Sensing Image Segmentation, [Paper]
  • (arXiv 2023.05) CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo, [Paper]
  • (arXiv 2023.10) GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers, [Paper]
  • (arXiv 2023.12) CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer, [Paper], [Code]
  • (arXiv 2023.12) Global Occlusion-Aware Transformer for Robust Stereo Matching, [Paper], [Code]

NAS

  • (CVPR'21) HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers, [Paper], [Code]
  • (arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]
  • (arXiv.2021.03) BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search, [Paper], [Code]
  • (arXiv.2021.06) Vision Transformer Architecture Search, [Paper], [Code]
  • (arXiv.2021.07) AutoFormer: Searching Transformers for Visual Recognition, [Paper], [Code]
  • (arXiv.2021.07) GLiT: Neural Architecture Search for Global and Local Image Transformer, [Paper]
  • (arXiv.2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper]
  • (arXiv.2021.10) UniNet: Unified Architecture Search with Convolution, Transformer, and MLP, [Paper]
  • (arXiv.2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]
  • (arXiv.2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]
  • (arXiv.2022.03) Vision Transformer with Convolutions Architecture Search, [Paper]
  • (arXiv.2022.03) Training-free Transformer Architecture Search, [Paper]
  • (arXiv.2022.06) Neural Prompt Search, [Paper]
  • (arXiv.2022.07) UniNet: Unified Architecture Search with Convolution, Transformer, and MLP, [Paper], [Code]
  • (arXiv.2022.09) NasHD: Efficient ViT Architecture Performance Ranking using Hyperdimensional Computing, [Paper]
  • (arXiv.2022.11) NAR-Former: Neural Architecture Representation Learning towards Holistic Attributes Prediction, [Paper]
  • (arXiv 2023.03) HyT-NAS: Hybrid Transformers Neural Architecture Search for Edge Devices, [Paper], [Code]
  • (arXiv 2023.07) AutoST: Training-free Neural Architecture Search for Spiking Transformers, [Paper]
  • (arXiv 2023.08) TurboViT: Generating Fast Vision Transformers via Generative Architecture Search, [Paper]
  • (arXiv 2023.11) FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer, [Paper], [Code]
  • (arXiv 2023.11) TVT: Training-Free Vision Transformer Search on Tiny Datasets, [Paper]
  • (arXiv 2023.12) Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery, [Paper]

Navigation

  • (ICLR'21) VTNet: Visual Transformer Network for Object Goal Navigation, [Paper]
  • (arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]
  • (arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]
  • (arXiv 2021.05) Episodic Transformer for Vision-and-Language Navigation, [Paper]
  • (arXiv 2021.07) Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World, [Paper]
  • (arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]
  • (arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Code]
  • (arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]
  • (arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation, [Paper], [Project]
  • (arXiv 2022.03) Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, [Paper], [Project]
  • (arXiv 2022.03) Object Memory Transformer for Object Goal Navigation, [Paper]
  • (arXiv 2022.07) Target-Driven Structured Transformer Planner for Vision-Language Navigation, [Paper], [Code]
  • (arXiv 2023.05) ASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation, [Paper]
  • (arXiv 2023.06) ViNT: A Foundation Model for Visual Navigation, [Paper], [Code]
  • (arXiv 2023.07) GridMM: Grid Memory Map for Vision-and-Language Navigation, [Paper], [Code]
  • (arXiv 2023.08) Bird’s-Eye-View Scene Graph for Vision-Language Navigation, [Paper], [Code]
  • (arXiv 2023.08) Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation, [Paper], [Code]
  • (arXiv 2023.11) Navigating Scaling Laws: Accelerating Vision Transformer's Training via Adaptive Strategies, [Paper]

Neural Rendering

  • (arXiv 2022.03) ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers, [Paper], [Code]
  • (arXiv 2022.06) Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer, [Paper]
  • (arXiv 2022.06) IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering inIndoor Scenes, [Paper], [Code]
  • (arXiv 2022.07) Vision Transformer for NeRF-Based View Synthesis from a Single Input Image, [Paper], [Project]
  • (arXiv 2022.09) NeRF-Loc: Transformer-Based Object Localization Within Neural Radiance Fields, [Paper]
  • (arXiv 2023.03) Single-view Neural Radiance Fields with Depth Teacher, [Paper]
  • (arXiv 2024.01) CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video, [Paper]

OCR

  • (arXiv 2021.04) Handwriting Transformers, [Paper]
  • (arXiv 2021.05) I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition, [Paper]
  • (arXiv 2021.05) Vision Transformer for Fast and Efficient Scene Text Recognition, [Paper]
  • (arXiv 2021.06) DocFormer: End-to-End Transformer for Document Understanding, [Paper]
  • (arXiv 2021.08) A Transformer-based Math Language Model for Handwritten Math Expression Recognition, [Paper]
  • (arXiv 2021.09) TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, [Paper], [Code]
  • (arXiv 2021.10) Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks, [Paper], [Code]
  • (arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper]
  • (arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]
  • (arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]
  • (arXiv 2021.12) SPTS: Single-Point Text Spotting, [Paper]
  • (arXiv 2022.02) Arbitrary Shape Text Detection using Transformers, [Paper]
  • (arXiv 2022.03) DiT: Self-supervised Pre-training for Document Image Transformer, [Paper], [Code]
  • (arXiv 2022.03) TrueType Transformer: Character and Font Style Recognition in Outline Format, [Paper]
  • (arXiv 2022.03) SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition, [Paper], [Code]
  • (arXiv 2022.03) Transformer-based HTR for Historical Documents, [Paper]
  • (arXiv 2022.04) Text Spotting Transformers, [Paper], [Code]
  • (arXiv 2022.05) Arbitrary Shape Text Detection via Boundary Transformer, [Paper], [Code]
  • (arXiv 2022.05) MATrIX - Modality-Aware Transformer for Information eXtraction, [Paper]
  • (arXiv 2022.06) Transformer based Urdu Handwritten Text Optical Character Reader, [Paper]
  • (arXiv 2022.06) SVG Vector Font Generation for Chinese Characters with Transformer, [Paper]
  • (arXiv 2022.07) DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer, [Paper], [Code]
  • (arXiv 2022.07) CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition, [Paper], [Code]
  • (arXiv 2022.08) Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition, [Paper], [Code]
  • (arXiv 2022.08) DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection, [Paper]
  • (arXiv 2022.08) Offline Handwritten Mathematical Recognition using Adversarial Learning and Transformers, [Paper]
  • (arXiv 2022.08) An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition using a Novel Transformers-based Model and an Innovative 270 Million-Words Multi-Font Corpus of Classical Arabic with Diacritics, [Paper]
  • (arXiv 2022.08) TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers, [Paper]
  • (arXiv 2022.09) ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding, [Paper]
  • (arXiv 2022.11) A Transformer Architecture for Online Gesture Recognition of Mathematical Expressions, [Paper]
  • (arXiv 2022.11) Masked Vision-Language Transformers for Scene Text Recognition, [Paper], [Code]
  • (arXiv 2022.11) Pure Transformer with Integrated Experts for Scene Text Recognition, [Paper]
  • (arXiv 2022.11) DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting, [Paper]
  • (arXiv 2022.11) Aggregated Text Transformer for Scene Text Detection, [Paper]
  • (arXiv 2023.03) Robust Table Structure Recognition with Dynamic Queries Enhanced Detection Transformer, [Paper]
  • (arXiv 2023.03) MSdocTr-Lite: A Lite Transformer for Full Page Multi-script Handwriting Recognition, [Paper]
  • (arXiv 2023.03) DeepVecFont-v2: Exploiting Transformers to Synthesize Vector Fonts with Higher Quality, [Paper]
  • (arXiv 2023.05) Towards End-to-End Semi-Supervised Table Detection with Deformable Transformer, [Paper]
  • (arXiv 2023.05) Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding, [Paper]
  • (arXiv 2023.05) Quantifying Character Similarity with Vision Transformers, [Paper]
  • (arXiv 2023.05) DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting, [Paper], [Code]
  • (arXiv 2023.06) DocFormerv2: Local Features for Document Understanding, [Paper]
  • (arXiv 2023.06) Transformer-Based UNet with Multi-Headed Cross-Attention Skip Connections to Eliminate Artifacts in Scanned Documents, [Paper]
  • (arXiv 2023.06) TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision, [Paper]
  • (arXiv 2023.06) Exploring Transformers for On-Line Handwritten Signature Verification, [Paper]
  • (arXiv 2023.07) DocTr: Document Transformer for Structured Information Extraction in Documents, [Paper]
  • (arXiv 2023.07) A Transformer-based Approach for Arabic Offline Handwritten Text Recognition, [Paper]
  • (arXiv 2023.08) ChartDETR: A Multi-shape Detection Network for Visual Chart Recognition, [Paper]
  • (arXiv 2023.08) SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation, [Paper], [Code]
  • (arXiv 2023.08) ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer, [Paper], [Code]
  • (arXiv 2023.08) PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer, [Paper]
  • (arXiv 2023.08) DTrOCR: Decoder-only Transformer for Optical Character Recognition, [Paper]
  • (arXiv 2023.09) Character Queries: A Transformer-based Approach to On-Line Handwritten Character Segmentation, [Paper], [Code]
  • (arXiv 2023.09) ShaDocFormer: A Shadow-attentive Threshold Detector with Cascaded Fusion Refiner for document shadow removal' to the ICASSP 2024 online submission system, [Paper]
  • (arXiv 2023.10) DocStormer: Revitalizing Multi-Degraded Colored Document Images to Pristine PDF, [Paper]
  • (arXiv 2023.11) High-Performance Transformers for Table Structure Recognition Need Early Convolutions, [Paper], [Code]
  • (arXiv 2023.11) Vulnerability Analysis of Transformer-based Optical Character Recognition to Adversarial Attacks, [Paper]
  • (arXiv 2023.12) DocBinFormer: A Two-Level Transformer Network for Effective Document Image Binarization, [Paper],[Code]
  • (arXiv 2024.01) STR-Cert: Robustness Certification for Deep Text Recognition on Deep Learning Pipelines and Vision Transformers, [Paper]
  • (arXiv 2024.01) SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting, [Paper],[Code]

Octree

  • (arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper], [Code]
  • (arXiv 2023.03) OcTr: Octree-based Transformer for 3D Object Detection, [Paper]
  • (arXiv 2023.05) OctFormer: Octree-based Transformers for 3D Point Clouds, [Paper], [Code]

Open World

  • (arXiv 2022.03) Open Set Recognition using Vision Transformer with an Additional Detection Head, [Paper]
  • (arXiv 2022.06) OOD Augmentation May Be at Odds with Open-Set Recognition, [Paper]
  • (arXiv 2022.07) Scaling Novel Object Detection with Weakly Supervised Detection Transformers, [Paper]
  • (arXiv 2022.09) Pre-training image-language transformers for open-vocabulary tasks, [Paper]
  • (arXiv 2022.10) Transformer-Based Speech Synthesizer Attribution in an Open Set Scenario, [Paper]
  • (arXiv 2022.12) PROB: Probabilistic Objectness for Open World Object Detection, [Paper], [Code]
  • (arXiv 2022.12) Open World DETR: Transformer based Open World Object Detection, [Paper]
  • (arXiv 2023.01) CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection, [Paper], [Code]
  • (arXiv 2023.03) Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection, [Paper]
  • (arXiv 2023.05) Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers, [Paper]
  • (arXiv 2023.08) SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning, [Paper], [Code]
  • (arXiv 2023.09) Contrastive Feature Masking Open-Vocabulary Vision Transformer, [Paper]
  • (arXiv 2023.09) Diffusion Model is Secretly a Training-free Open Vocabulary er, [Paper]
  • (arXiv 2023.09) Unsupervised Open-Vocabulary Object Localization in Videos, [Paper], [Code]
  • (arXiv 2023.10) CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection, [Paper], [Code]
  • (arXiv 2023.11) Enhancing Novel Object Detection via Cooperative Foundational Models, [Paper], [Code]
  • (arXiv 2023.11) Language-conditioned Detection Transformer, [Paper], [Code]
  • (arXiv 2023.12) Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection, [Paper]
  • (arXiv 2023.12) Boosting Segment Anything Model Towards Open-Vocabulary Learning, [Paper], [Code]
  • (arXiv 2023.12) Open World Object Detection in the Era of Foundation Models, [Paper], [Code]

Optical Flow

  • (arXiv 2022.03) FlowFormer: A Transformer Architecture for Optical Flow, [Paper], [Project]
  • (arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow, [Paper], [Code]
  • (arXiv 2023.03) FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation, [Paper]
  • (arXiv 2023.04) TransFlow: Transformer as Flow Learner, [Paper]
  • (arXiv 2023.05) SSTM: Spatiotemporal Recurrent Transformers for Multi-frame Optical Flow Estimation, [Paper]
  • (arXiv 2023.06) FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow, [Paper]

Panoptic Segmentation

  • (arXiv.2020.12) MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers, [Paper]
  • (arXiv 2021.09) Panoptic SegFormer, [Paper]
  • (arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]
  • (arXiv 2021.10) An End-to-End Trainable Video Panoptic Segmentation Method using Transformers, [Paper]
  • (arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
  • (arXiv 2021.12) PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, [Paper], [Code]
  • (arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation, [Paper], [Code]
  • (arXiv 2022.05) CONSENT: Context Sensitive Transformer for Bold Words Classification, [Paper]
  • (arXiv 2022.05) CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation, [Paper]
  • (arXiv 2022.07) k-means Mask Transformer, [Paper], [Code]
  • (arXiv 2022.07) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
  • (arXiv 2022.07) Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic ation, [Paper], [Code]
  • (arXiv 2022.10) Time-Space Transformers for Video Panoptic Segmentation, [Paper], [Code]
  • (arXiv 2022.10) Uncertainty-aware LiDAR Panoptic Segmentation, [Paper], [Code]
  • (arXiv 2022.10) A Generalist Framework for Panoptic Segmentation of Images and Videos, [Paper]
  • (arXiv 2022.10) Pointly-Supervised Panoptic Segmentation, [Paper], [Code]
  • (arXiv 2023.03) Position-Guided Point Cloud Panoptic Segmentation Transformer, [Paper], [Code]
  • (arXiv 2023.07) Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning, [Paper]
  • (arXiv 2023.08) LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment, [Paper], [Code]
  • (arXiv 2023.08) PanoSwin: a Pano-style Swin Transformer for Panorama Understanding, [Paper]
  • (arXiv 2023.09) MASK4D: Mask Transformer for 4D Panoptic Segmentation, [Paper], [Code]
  • (arXiv 2023.10) Hierarchical Mask2Former: Panoptic Segmentation of Crops, Weeds and Leaves, [Paper], [Code]
  • (arXiv 2023.11) 4D-Former: Multimodal 4D Panoptic Segmentation, [Paper], [Code]
  • (arXiv 2023.11) MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation, [Paper], [Code]
  • (arXiv 2024.01) 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation, [Paper]
  • (arXiv 2024.01) Scalable 3D Panoptic Segmentation With Superpoint Graph Clustering, [Paper], [Code]

Point Cloud

  • (ICRA'21) NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation, [Paper]
  • (arXiv 2020.12) Point Transformer, [Paper]
  • (arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
  • (arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
  • (arXiv 2021.03) You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module, [Paper], [Code]
  • (arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]
  • (arXiv 2021.04) M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers, [Paper]
  • (arXiv 2021.04) Dual Transformer for Point Cloud Analysis, [Paper]
  • (arXiv 2021.04) Point Cloud Learning with Transformer, [Paper]
  • (arXiv 2021.08) SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer, [Paper], [Code]
  • (arXiv 2021.08) PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds, [Paper], [Code]
  • (arXiv 2021.08) Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning, [Paper], [Code]
  • (arXiv 2021.08) PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers, [Paper], [Code]
  • (arXiv 2021.08) Improving 3D Object Detection with Channel-wise Transformer, [Paper], [Code]
  • (arXiv 2021.09) PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds, [Paper], [Code]
  • (arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper]
  • (arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]
  • (arXiv 2021.10) PatchFormer: A Versatile 3D Transformer Based on Patch Attention, [Paper]
  • (arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]
  • (arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]
  • (arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]
  • (arXiv 2021.11) Adaptive Channel Encoding Transformer for Point Cloud Analysis, [Paper], [Code]
  • (arXiv 2021.11) Fast Point Transformer, [Paper]
  • (arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]
  • (arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]
  • (arXiv 2022.02) Geometric Transformer for Fast and Robust Point Cloud Registration, [Paper], [Code]
  • (arXiv 2022.02) LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling, [Paper]
  • (arXiv 2022.02) PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths, [Paper]
  • (arXiv 2022.02) Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer, [Paper], [Code]
  • (arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in Point Cloud, [Paper]
  • (arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification, [Paper]
  • (arXiv 2022.03) Masked Autoencoders for Point Cloud Self-supervised Learning, [Paper]
  • (arXiv 2022.03) CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance, [Paper]
  • (arXiv 2022.03) Masked Discrimination for Self-Supervised Learning on Point Clouds, [Paper], [Code]
  • (arXiv 2022.03) Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds, [Paper], [Code]
  • (arXiv 2022.03) V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer, [Paper]
  • (arXiv 2022.03) REGTR: End-to-end Point Cloud Correspondences with Transformers, [Paper], [Code]
  • (arXiv 2022.03) Stratified Transformer for 3D Point Cloud Segmentation, [Paper], [Code]
  • (arXiv 2022.04) HiTPR: Hierarchical Transformer for Place Recognition in Point Cloud, [Paper]
  • (arXiv 2022.04) Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds, [Paper], [Code]
  • (arXiv 2022.04) Panoptic-PHNet: Towards Real-Time and High-Precision LiDAR Panoptic Segmentation via Clustering Pseudo Heatmap, [Paper]
  • (arXiv 2022.04) VNT-Net: Rotational Invariant Vector Neuron Transformers, [Paper]
  • (arXiv 2022.05) Towards Model Generalization for Monocular 3D Object Detection, [Paper]
  • (arXiv 2022.05) CompleteDT: Point Cloud Completion with Dense Augment Inference Transformers, [Paper]
  • (arXiv 2022.05) TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving, [Paper], [Code]
  • (arXiv 2022.06) SpikiLi: A Spiking Simulation of LiDAR based Real-time Object Detection for Autonomous Driving, [Paper]
  • (arXiv 2022.06) VN-Transformer: Rotation-Equivariant Attention for Vector Neurons, [Paper]
  • (arXiv 2022.06) PST: Plant Segmentation Transformer Enhanced Phenotyping of MLS Oilseed Rape Point Cloud, [Paper]
  • (arXiv 2022.07) SeedFormer: Patch Seeds based Point Cloud Completion with Upsample Transformer, [Paper], [Code]
  • (arXiv 2022.07) Geodesic-Former: a Geodesic-Guided Few-shot 3D Point Cloud Instance Segmenter, [Paper], [Code]
  • (arXiv 2022.07) Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds, [Paper]
  • (arXiv 2022.08) Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding, [Paper]
  • (arXiv 2022.08) Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
  • (arXiv 2022.08) PointTree: Transformation-Robust Point Cloud Encoder with Relaxed K-D Trees, [Paper], [Code]
  • (arXiv 2022.08) Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding, [Paper], [Code]
  • (arXiv 2022.09) 3DPCT: 3D Point Cloud Transformer with Dual Self-attention, [Paper]
  • (arXiv 2022.10) Transformers for Object Detection in Large Point Clouds, [Paper]
  • (arXiv 2022.10) Bridged Transformer for Vision and Point Cloud 3D Object Detection, [Paper]
  • (arXiv 2022.10) Introducing Vision Transformer for Alzheimer's Disease classification task with 3D input, [Paper]
  • (arXiv 2022.10) Point Cloud Recognition with Position-to-Structure Attention Transformers, [Paper]
  • (arXiv 2022.10) Point Transformer V2: Grouped Vector Attention and Partition-based Pooling, [Paper], [Code]
  • (arXiv 2022.10) SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds, [Paper]
  • (arXiv 2022.10) LCPFormer: Towards Effective 3D Point Cloud Analysis via Local Context Propagation in Transformers, [Paper]
  • (arXiv 2022.10) PSFormer: Point Transformer for 3D Salient Object Detection, [Paper]
  • (arXiv 2022.11) Hyperbolic Cosine Transformer for LiDAR 3D Object Detection, [Paper]
  • (arXiv 2022.11) Completing point cloud from few points by Wasserstein GAN and Transformers, [Paper], [Code]
  • (arXiv 2022.11) PVT3D: Point Voxel Transformers for Place Recognition from Sparse Lidar Scans, [Paper]
  • (arXiv 2022.11) 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers, [Paper]
  • (arXiv 2023.01) AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers, [Paper], [Code]
  • (arXiv 2023.01) Text to Point Cloud Localization with Relation-Enhanced Transformer, [Paper]
  • (arXiv 2023.01) SAT: Size-Aware Transformer for 3D Point Cloud ation, [Paper]
  • (arXiv 2023.01) DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets, [Paper], [Code]
  • (arXiv 2023.01) PTA-Det: Point Transformer Associating Point cloud and Image for 3D Object Detection, [Paper]
  • (arXiv 2023.01) FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer, [Paper]
  • (arXiv 2023.01) Slice Transformer and Self-supervised Learning for 6DoF Localization in 3D Point Cloud Maps, [Paper]
  • (arXiv 2023.02) TR3D: Towards Real-Time Indoor 3D Object Detection, [Paper], [Code]
  • (arXiv 2023.02) ProxyFormer: Proxy Alignment Assisted Point Cloud Completion with Missing Part Sensitive Transformer, [Paper], [Code]
  • (arXiv 2023.03) Applying Plain Transformers to Real-World Point Clouds, [Paper]
  • (arXiv 2023.03) BPT: Binary Point Cloud Transformer for Place Recognition, [Paper]
  • (arXiv 2023.03) Improving the quality of dental crown using a Transformer-based method, [Paper], [Code]
  • (arXiv 2023.03) Point Cloud Classification Using Content-based Transformer via Clustering in Feature Space, [Paper], [Code]
  • (arXiv 2023.03) Efficient Transformer-based 3D Object Detection with Dynamic Token Halting, [Paper]
  • (arXiv 2023.03) Rotation-Invariant Transformer for Point Cloud Matching, [Paper]
  • (arXiv 2023.03) Quality evaluation of point clouds: a novel no-reference approach using transformer-based architecture, [Paper]
  • (arXiv 2023.03) Spherical Transformer for LiDAR-based 3D Recognition, [Paper], [Code]
  • (arXiv 2023.03) Context-Aware Transformer for 3D Point Cloud Automatic Annotation, [Paper]
  • (arXiv 2023.03) ViPFormer: Efficient Vision-and-Pointcloud Transformer for Unsupervised Pointcloud Understanding, [Paper], [Code]
  • (arXiv 2023.03) StarNet: Style-Aware 3D Point Cloud Generation, [Paper]
  • (arXiv 2023.03) Self-positioning Point-based Transformer for Point Cloud Understanding, [Paper], [Code]
  • (arXiv 2023.04) APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding, [Paper]
  • (arXiv 2023.04) PointCAT: Cross-Attention Transformer for point cloud, [Paper]
  • (arXiv 2023.04) Multi-scale Geometry-aware Transformer for 3D Point Cloud Classification, [Paper]
  • (arXiv 2023.04) Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding, [Paper], [Code]
  • (arXiv 2023.04) PCPNet: An Efficient and Semantic-Enhanced Transformer Network for Point Cloud Prediction, [Paper], [Code]
  • (arXiv 2023.04) Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation, [Paper], [Code]
  • (arXiv 2023.05) PU-EdgeFormer: Edge Transformer for Dense Prediction in Point Cloud Upsampling, [Paper], [Code]
  • (arXiv 2023.05) Point Transformer For Coronary Artery Labeling, [Paper]
  • (arXiv 2023.06) Collect-and-Distribute Transformer for 3D Point Cloud Analysis, [Paper], [Code]
  • (arXiv 2023.06) Efficient 3D ation with Superpoint Transformer, [Paper], [Code]
  • (arXiv 2023.06) A deep dive into explainable self-supervised transformers for point clouds, [Paper], [Code]
  • (arXiv 2023.07) SVDFormer: Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generator, [Paper], [Code]
  • (arXiv 2023.07) PSGformer: Enhancing 3D Point Cloud Instance Segmentation via Precise Semantic Guidance, [Paper]
  • (arXiv 2023.07) PG-RCNN: Semantic Surface Point Generation for 3D Object Detection, [Paper], [Code
  • (arXiv 2023.07) Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition, [Paper]
  • (arXiv 2023.07) pCTFusion: Point Convolution-Transformer Fusion with Semantic Aware Loss for Outdoor LiDAR Point Cloud Segmentation, [Paper]
  • (arXiv 2023.08) Self-supervised Learning of Rotation-invariant 3D Point Set Features using Transformer and its Self-distillation, [Paper]
  • (arXiv 2023.09) Weakly Supervised Point Clouds Transformer for 3D Object Detection, [Paper]
  • (arXiv 2023.09) Research on self-cross transformer model of point cloud change detecter, [Paper]
  • (arXiv 2023.09) Radar Instance Transformer: Reliable Moving Instance Segmentation in Sparse Radar Point Clouds, [Paper]
  • (arXiv 2023.10) Uni3D: Exploring Unified 3D Representation at Scale, [Paper], [Code]
  • (arXiv 2023.10) 2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision, [Paper], [Code]
  • (arXiv 2023.11) DeepEMD: A Transformer-based Fast Estimation of the Earth Mover’s Distance, [Paper], [Code]
  • (arXiv 2023.11) OneFormer3D: One Transformer for Unified Point Cloud Segmentation, [Paper]
  • (arXiv 2023.11) CalibFormer: A Transformer-based Automatic LiDAR-Camera Calibration Network, [Paper]
  • (arXiv 2023.12) Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation, [Paper], [Code]
  • (arXiv 2023.12) TULIP: Transformer for Upsampling of LiDAR Point Cloud, [Paper]
  • (arXiv 2023.12) PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection, [Paper], [Code]
  • (arXiv 2023.12) Point Transformer V3: Simpler, Faster, Stronger, [Paper], [Code]
  • (arXiv 2023.12) ConDaFormer: Disassembled Transformer with Local Structure Enhancement for 3D Point Cloud Understanding, [Paper], [Code]
  • (arXiv 2023.12) Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding, [Paper]
  • (arXiv 2024.01) 3D Landmark Detection on Human Point Clouds: A Benchmark and A Dual Cascade Point Transformer Framework, [Paper]
  • (arXiv 2024.01) CascadeV-Det: Cascade Point Voting for 3D Object Detection, [Paper]

Pose

  • (arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
  • (arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
  • (arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]
  • (arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]
  • (arXiv 2021.03) Lifting Transformer for 3D Human Pose Estimation in Video, [Paper]
  • (arXiv 2021.03) TFPose: Direct Human Pose Estimation with Transformers, [Paper]
  • (arXiv 2021.04) Pose Recognition with Cascade Transformers, [Paper], [Code]
  • (arXiv 2021.04) TokenPose: Learning Keypoint Tokens for Human Pose Estimation, [Paper]
  • (arXiv 2021.04) Skeletor: Skeletal Transformers for Robust Body-Pose Estimation, [Paper]
  • (arXiv 2021.04) HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction, [Paper]
  • (arXiv 2021.07) Test-Time Personalization with a Transformer for Human Pose Estimation, [Paper]
  • (arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
  • (arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
  • (arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]
  • (arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]
  • (arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]
  • (arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]
  • (arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Code]
  • (arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]
  • (arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper], [Code]
  • (arXiv 2021.12) Towards Deep Learning-based 6D Bin Pose Estimation in 3D Scans, [Paper]
  • (arXiv 2021.12) End-to-End Learning of Multi-category 3D Pose and Shape Estimation, [Paper]
  • (arXiv 2022.01) Swin-Pose: Swin Transformer Based Human Pose Estimation, [Paper]
  • (arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers, [Paper]
  • (arXiv 2022.02) HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders, [Paper]
  • (arXiv 2022.03) CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation, [Paper]
  • (arXiv 2022.04) BTranspose: Bottleneck Transformers for Human Pose Estimation with Self-Supervised Pre-Training, [Paper]
  • (arXiv 2022.04) ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation, [Paper], [Code]
  • (arXiv 2022.05) YOLOPose: Transformer-based Multi-Object 6D Pose Estimation using Keypoint Regression, [Paper]
  • (arXiv 2022.05) AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation, [Paper], [Code]
  • (arXiv 2022.05) VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation, [Paper]
  • (arXiv 2022.05) Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation, [Paper]
  • (arXiv 2022.07) OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos, [Paper]
  • (arXiv 2022.08) Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer, [Paper]
  • (arXiv 2022.08) IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation, [Paper]
  • (arXiv 2022.08) Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation, [Paper]
  • (arXiv 2022.08) The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs, [Paper], [Code]
  • (arXiv 2022.08) PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling, [Paper], [Code]
  • (arXiv 2022.08) K-Order Graph-oriented Transformer with GraAttention for 3D Pose and Shape Estimation, [Paper]
  • (arXiv 2022.08) SoMoFormer: Multi-Person Pose Forecasting with Transformers, [Paper], [Code]
  • (arXiv 2022.09) DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation, [Paper]
  • (arXiv 2022.09) PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation, [Paper], [Code]
  • (arXiv 2022.10) Exploiting the Joint Motion Synergy with Fusion Network Based On Transformer for 3D Human Pose Estimation, [Paper]
  • (arXiv 2022.10) Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers, [Paper], [Code]
  • (arXiv 2022.10) Transformer-based Global 3D Hand Pose Estimation in Two Hands Manipulating Objects Scenarios, [Paper]
  • (arXiv 2022.10) CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement Transformers, [Paper], [Code]
  • (arXiv 2022.10) Video based Object 6D Pose Estimation using Transformers, [Paper], [Code]
  • (arXiv 2022.11) PoET: Pose Estimation Transformer for Single-View, Multi-Object 6D Pose Estimation, [Paper], [Code]
  • (arXiv 2022.11) MPT: Mesh Pre-Training with Transformers for Human Pose and Mesh Reconstruction, [Paper]
  • (arXiv 2022.12) ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation, [Paper], [Code]
  • (arXiv 2023.01) HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2023.01) A Modular Multi-stage Lightweight Graph Transformer Network for Human Pose and Shape Estimation from 2D Human Pose, [Paper], [Code]
  • (arXiv 2023.02) HDFormer: High-order Directed Transformer for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2023.02) HybrIK-Transformer, [Paper], [Code]
  • (arXiv 2023.02) Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation, [Paper]
  • (arXiv 2023.03) Depth-based 6DoF Object Pose Estimation using Swin Transformer, [Paper], [Code]
  • (arXiv 2023.03) Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting, [Paper]
  • (arXiv 2023.03) Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation, [Paper]
  • (arXiv 2023.03) Human Pose Estimation from Ambiguous Pressure Recordings with Spatio-temporal Masked Transformers, [Paper]
  • (arXiv 2023.03) PoseRAC: Pose Saliency Transformer for Repetitive Action Counting, [Paper], [Code]
  • (arXiv 2023.03) TransPoser: Transformer as an Optimizer for Joint Object Shape and Pose Estimation, [Paper]
  • (arXiv 2023.03) PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2023.04) ConvFormer: Parameter Reduction in Transformer Models for 3D Human Pose Estimation by Leveraging Dynamic Multi-Headed Convolutional Attention, [Paper]
  • (arXiv 2023.04) A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation from a Single RGB Image, [Paper], [Code]
  • (arXiv 2023.05) Poses as Queries: Image-to-LiDAR Map Localization with Transformers, [Paper], [Code]
  • (arXiv 2023.06) Self-supervised Vision Transformers for 3D Pose Estimation of Novel Objects, [Paper], [Code]
  • (arXiv 2023.06) Efficient Vision Transformer for Human Pose Estimation via Patch Selection, [Paper]
  • (arXiv 2023.06) A Dual-Source Attention Transformer for Multi-Person Pose Tracking, [Paper], [Code]
  • (arXiv 2023.06) Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers, [Paper], [Code]
  • (arXiv 2023.06) LPFormer: LiDAR Pose Estimation Transformer with Multi-Task Network, [Paper]
  • (arXiv 2023.07) TransPose: A Transformer-based 6D Object Pose Estimation Network with Depth Refinement, [Paper]
  • (arXiv 2023.07) YOLOPose V2: Understanding and Improving Transformer-based 6D Pose Estimation, [Paper]
  • (arXiv 2023.07) TransNet: Transparent Object Manipulation Through Category-Level Pose Estimation, [Paper], [Code]
  • (arXiv 2023.08) Scene-aware Human Pose Generation using Transformer, [Paper]
  • (arXiv 2023.08) Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation, [Paper], [Code]
  • (arXiv 2023.08) Double-chain Constraints for 3D Human Pose Estimation in Images and Videos, [Paper], [Code]
  • (arXiv 2023.08) Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation, [Paper], [Code1], [Code2]
  • (arXiv 2023.08) EgoPoser: Robust Real-Time Ego-Body Pose Estimation in Large Scenes, [Paper]
  • (arXiv 2023.08) Coarse-to-Fine Multi-Scene Pose Regression with Transformers, [Paper], [Code]
  • (arXiv 2023.08) Two-Stage Violence Detection Using ViTPose and Classification Models at Smart Airports, [Paper], [Code]
  • (arXiv 2023.09) Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2023.09) ZS6D: Zero-shot 6D Object Pose Estimation using Vision Transformers, [Paper]
  • (arXiv 2023.10) LEAP: Liberate Sparse-view 3D Modeling from Camera Poses, [Paper], [Code]
  • (arXiv 2023.10) MFOS: Model-Free & One-Shot Object Pose Estimation, [Paper]
  • (arXiv 2023.10) UniPose: Detecting Any Keypoints, [Paper], [Code]
  • (arXiv 2023.10) MoEmo Vision Transformer: Integrating Cross-Attention and Movement Vectors in 3D Pose Estimation for HRI Emotion Detection, [Paper], [Code]
  • (arXiv 2023.10) MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network, [Paper], [Code]
  • (arXiv 2023.10) TransPose: 6D Object Pose Estimation with Geometry-Aware Transformer, [Paper]
  • (arXiv 2023.10) A Spatial-Temporal Transformer based Framework For Human Pose Assessment And Correction in Education Scenarios, [Paper]
  • (arXiv 2023.11) Multiple View Geometry Transformers for 3D Human Pose Estimation, [Paper], [Code]
  • (arXiv 2023.11) Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation, [Paper]
  • (arXiv 2023.11) Fingerspelling PoseNet: Enhancing Fingerspelling Translation with Pose-Based Transformer Models, [Paper], [Code]
  • (arXiv 2023.11) HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation, [Paper], [Code]
  • (arXiv 2023.11) SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation, [Paper], [Code]
  • (arXiv 2023.11) Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation, [Paper], [Code]
  • (arXiv 2023.11) PViT-6D: Overclocking Vision Transformers for 6D Pose Estimation with Confidence-Level Prediction and Pose Tokens, [Paper]
  • (arXiv 2023.12) PoseViNet: Distracted Driver Action Recognition Framework Using Multi-View Pose Estimation and Vision Transformer, [Paper]
  • (arXiv 2023.12) Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction, [Paper]
  • (arXiv 2024.01) 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation, [Paper]
  • (arXiv 2024.01) Towards Real-World Aerial Vision Guidance with Categorical 6D Pose Tracker, [Paper], [Code]

Planning

  • (arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]

Pruning & Quantization

  • (arXiv 2021.04) Visual Transformer Pruning, [Paper]
  • (arXiv 2021.06) Post-Training Quantization for Vision Transformer, [Paper]
  • (arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper], [Code]
  • (arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper]
  • (arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]
  • (arXiv 2022.03) Patch Similarity Aware Data-Free Quantization for Vision Transformers, [Paper]
  • (arXiv 2022.03) CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction, [Paper]
  • (arXiv 2022.07) I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference, [Paper]
  • (arXiv 2022.08) Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization, [Paper]
  • (arXiv 2022.09) PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers, [Paper], [Code]
  • (arXiv 2022.10) EAPruning: Evolutionary Pruning for Vision Transformers and CNNs, [Paper]
  • (arXiv 2022.10) SaiT: Sparse Vision Transformers through Adaptive Token Pruning, [Paper]
  • (arXiv 2022.10) Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer, [Paper], [Code]
  • (arXiv 2022.10) oViT: An Accurate Second-Order Pruning Framework for Vision Transformers, [Paper]
  • (arXiv 2022.11) CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, [Paper]
  • (arXiv 2022.11) NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers, [Paper]
  • (arXiv 2022.12) Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis, [Paper], [Code]
  • (arXiv 2022.12) RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers, [Paper]
  • (arXiv 2023.02) Oscillation-free Quantization for Low-bit Vision Transformers, [Paper]
  • (arXiv 2023.03) Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction, [Paper], [Code]
  • (arXiv 2023.03) Scaled Quantization for the Vision Transformer, [Paper]
  • (arXiv 2023.03) Towards Accurate Post-Training Quantization for Vision Transformer, [Paper]
  • (arXiv 2023.04) Q-DETR: An Efficient Low-Bit Quantized Detection Transformer, [Paper]
  • (arXiv 2023.04) Attention Map Guided Transformer Pruning for Edge Device, [Paper]
  • (arXiv 2023.05) Patch-wise Mixed-Precision Quantization of Vision Transformer, [Paper]
  • (arXiv 2023.05) Boost Vision Transformer with GPU-Friendly Sparsity and Quantization, [Paper]
  • (arXiv 2023.05) Bi-ViT: Pushing the Limit of Vision Transformer Quantization, [Paper]
  • (arXiv 2023.07) Variation-aware Vision Transformer Quantization, [Paper], [Code]
  • (arXiv 2023.08) Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers, [Paper], [Code]
  • (arXiv 2023.08) Vision Transformer Pruning Via Matrix Decomposition, [Paper]
  • (arXiv 2023.09) Transformer-VQ: Linear-Time Transformers via Vector Quantization, [Paper], [Code]
  • (arXiv 2023.10) LLM-FP4: 4-Bit Floating-Point Quantized Transformers, [Paper], [Code]
  • (arXiv 2023.12) QuantAttack: Exploiting Dynamic Quantization to Attack Vision Transformers, [Paper]

Recognition

  • (arXiv 2021.03) Global Self-Attention Networks for Image Recognition, [Paper]
  • (arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]
  • (arXiv 2021.05) Are Convolutional Neural Networks or Transformers more like human vision, [Paper]
  • (arXiv 2021.07) Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition, [Paper]
  • (arXiv 2021.07) RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition, [Paper]
  • (arXiv 2021.08) DPT: Deformable Patch-based Transformer for Visual Recognition, [Paper], [Code]
  • (arXiv 2021.10) A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition, [Paper]
  • (arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]
  • (arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]
  • (arXiv 2022.01) TransVPR: Transformer-based place recognition with multi-level attention aggregation, [Paper]
  • (arXiv 2022.03) MetaFormer : A Unified Meta Framework for Fine-Grained Recognition, [Paper], [Code]
  • (arXiv 2022.04) Diverse Instance Discovery: Vision-Transformer for Instance-Aware Multi-Label Image Recognition, [Paper], [Code]
  • (arXiv 2022.07) Forensic License Plate Recognition with Compression-Informed Transformers, [Paper], [Code]
  • (arXiv 2022.08) TSRFormer: Table Structure Recognition with Transformers, [Paper]
  • (arXiv 2022.08) GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement, [Paper], [Code]
  • (arXiv 2022.09) SeqOT: A Spatial-Temporal Transformer Network for Place Recognition Using Sequential LiDAR Data, [Paper], [Code]
  • (arXiv 2022.12) Part-guided Relational Transformers for Fine-grained Visual Recognition, [Paper], [Code]
  • (arXiv 2023.02) CVTNet: A Cross-View Transformer Network for Place Recognition Using LiDAR Data, [Paper], [Code]
  • (arXiv 2023.02) Rethink Long-tailed Recognition with Vision Transforms, [Paper]
  • (arXiv 2023.04) R2Former: Unified Retrieval and Reranking Transformer for Place Recognition, [Paper], [Code]
  • (arXiv 2023.05) MASK-CNN-Transformer For Real-Time Multi-Label Weather Recognition, [Paper]
  • (arXiv 2023.05) TReR: A Lightweight Transformer Re-Ranking Approach for 3D LiDAR Place Recognition, [Paper]
  • (arXiv 2023.07) Convolutional Transformer for Autonomous Recognition and Grading of Tomatoes Under Various Lighting, Occlusion, and Ripeness Conditions, [Paper]
  • (arXiv 2023.08) M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition, [Paper]
  • (arXiv 2023.09) Parameter-Efficient Long-Tailed Recognition, [Paper], [Code]
  • (arXiv 2023.09) MAGIC-TBR: Multiview Attention Fusion for Transformer-based Bodily Behavior Recognition in Group Settings, [Paper], [Code]
  • (arXiv 2023.10) ClusVPR: Efficient Visual Place Recognition with Clustering-based Weighted Transformer, [Paper], [Code]
  • (arXiv 2023.10) FaultSeg Swin-UNETR: Transformer-Based Self-Supervised Pretraining Model for Fault Recognition, [Paper]
  • (arXiv 2023.12) Are Vision Transformers More Data Hungry Than Newborn Visual Systems, [Paper]

Reconstruction

  • (arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]
  • (arXiv 2021.06) THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers, [Paper]
  • (arXiv 2021.06) LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction, [Paper]
  • (arXiv 2021.07) TransformerFusion: Monocular RGB Scene Reconstruction using Transformers, [Paper]
  • (arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]
  • (arXiv 2021.11) Reference-based Magnetic Resonance Image Reconstruction Using Texture Transformer, [Paper]
  • (arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]
  • (arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]
  • (arXiv 2022.03) RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers, [Paper]
  • (arXiv 2022.05) 3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction, [Paper]
  • (arXiv 2022.05) HeatER: An Efficient and Unified Network for Human Reconstruction via Heatmap-based TransformER, [Paper]
  • (arXiv 2022.06) Extreme Floorplan Reconstruction by Structure-Hallucinating Transformer Cascades, [Paper]
  • (arXiv 2022.08) PlaneFormers: From Sparse View Planes to 3D Reconstruction, [Paper]
  • (arXiv 2023.01) Monocular Scene Reconstruction with 3D SDF Transformers, [Paper], [Project]
  • (arXiv 2023.02) Efficient 3D Object Reconstruction using Visual Transformers, [Paper], [Project]
  • (arXiv 2023.02) UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction, [Paper]
  • (arXiv 2023.03) CryoFormer: Continuous Reconstruction of 3D Structures from Cryo-EM Data using Transformer-based Neural Representations, [Paper], [Project]
  • (arXiv 2023.04) CornerFormer: Boosting Corner Representation for Fine-Grained Structured Reconstruction, [Paper]
  • (arXiv 2023.07) Image Reconstruction using Enhanced Vision Transformer, [Paper]
  • (arXiv 2023.08) Long-Range Grouping Transformer for Multi-View 3D Reconstruction, [Paper],[Code]
  • (arXiv 2023.08) A Transformer-Conditioned Neural Fields Pipeline with Polar Coordinate Representation for Astronomical Radio Interferometric Data Reconstruction, [Paper]
  • (arXiv 2023.09) Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction, [Paper],[Code]
  • (arXiv 2023.10) Sketch2CADScript: 3D Scene Reconstruction from 2D Sketch using Visual Transformer and Rhino Grasshopper, [Paper]
  • (arXiv 2023.10) ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map, [Paper]
  • (arXiv 2023.10) DIAR: Deep Image Alignment and Reconstruction using Swin Transformers, [Paper]
  • (arXiv 2023.12) Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers, [Paper],[Project]
  • (arXiv 2024.01) GridFormer: Point-Grid Transformer for Surface Reconstruction, [Paper],[Code]

Referring

  • (arXiv 2021.08) Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper], [Code]
  • (arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper],[Code]
  • (arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [Paper], [Code]
  • (arXiv 2022.10) VLT: Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper], [Code]
  • (arXiv 2023.09) Contrastive Grouping with Transformer for Referring Image Segmentation, [Paper],[Code]

Registration

  • (arXiv 2021.04) ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration, [Paper], [Code]
  • (arXiv 2022.02) A Transformer-based Network for Deformable Medical Image Registration, [Paper]
  • (arXiv 2022.03) Affine Medical Image Registration with Coarse-to-Fine Vision Transformer, [Paper], [Code]
  • (arXiv 2022.04) Symmetric Transformer-based Network for Unsupervised Image Registration, [Paper], [Code]
  • (arXiv 2023.03) Spatially-varying Regularization with Conditional Transformer for Unsupervised Image Registration, [Paper]
  • (arXiv 2023.03) RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration, [Paper]
  • (arXiv 2023.07) Non-iterative Coarse-to-fine Transformer Networks for Joint Affine and Deformable Image Registration, [Paper]
  • (arXiv 2023.08) 2D3D-MATR: 2D-3D Matching Transformer for Detection-free Registration between Images and Point Clouds, [Paper], [Code]
  • (arXiv 2023.08) GeoTransformer: Fast and Robust Point Cloud Registration with Geometric Transformer, [Paper], [Code]
  • (arXiv 2023.10) OAAFormer: Robust and Efficient Point Cloud Registration Through Overlapping-Aware Attention in Transformer, [Paper]
  • (arXiv 2023.12) VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning, [Paper], [Code]
  • (arXiv 2023.12) D3Former: Jointly Learning Repeatable Dense Detectors and Feature-enhanced Descriptors via Saliency-guided Transformer, [Paper]

Re-identification

  • (arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
  • (arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [Paper]
  • (arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]
  • (arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]
  • (arXiv 2021.06) Transformer-Based Deep Image Matching for Generalizable Person Re-identification, [Paper]
  • (arXiv 2021.06) Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer, [Paper]
  • (arXiv 2021.06) Person Re-Identification with a Locally Aware Transformer, [Paper]
  • (arXiv 2021.07) Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification, [Paper], [Code]
  • (arXiv 2021.07) GiT: Graph Interactive Transformer for Vehicle Re-identification, [Paper]
  • (arXiv 2021.07) HAT: Hierarchical Aggregation Transformers for Person Re-identification, [Paper]
  • (arXiv 2021.09) Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification, [Paper]
  • (arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]
  • (arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]
  • (arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]
  • (arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]
  • (arXiv 2022.01) Short Range Correlation Transformer for Occluded Person Re-Identification, [Paper]
  • (arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [Paper]
  • (arXiv 2022.04) PSTR: End-to-End One-Step Person Search With Transformers, [Paper], [Code]
  • (arXiv 2022.04) NFormer: Robust Person Re-identification with Neighbor Transformer, [Paper], [Code]
  • (arXiv 2022.09) Uncertainty Aware Multitask Pyramid Vision Transformer For UAV-Based Object Re-Identification, [Paper]
  • (arXiv 2022.11) Sequential Transformer for End-to-End Person Search, [Paper]
  • (arXiv 2022.11) Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification, [Paper], [Code]
  • (arXiv 2022.11) Learning Progressive Modality-shared Transformers for Effective Visible-Infrared Person Re-identification, [Paper], [Code]
  • (arXiv 2023.01) Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification, [Paper]
  • (arXiv 2023.02) X-ReID: Cross-Instance Transformer for Identity-Level Person Re-Identification, [Paper]
  • (arXiv 2023.02) DC-Former: Diverse and Compact Transformer for Person Re-Identification, [Paper]
  • (arXiv 2023.03) Feature Completion Transformer for Occluded Person Re-identification, [Paper]
  • (arXiv 2023.03) TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification, [Paper], [Code]
  • (arXiv 2023.04) Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification, [Paper], [Code]
  • (arXiv 2023.08) Part-Aware Transformer for Generalizable Person Re-identification, [Paper], [Code]
  • (arXiv 2023.10) GraFT: Gradual Fusion Transformer for Multimodal Re-Identification, [Paper]

Remote Sensing

  • (arXiv 2021.07) Looking Outside the Window: Wider-Context Transformer for the ation of High-Resolution Remote Sensing Images, [Paper]
  • (arXiv 2022.07) SiamixFormer: A Siamese Transformer Network For Building Detection And Change Detection From Bi-Temporal Remote Sensing Images, [Paper]
  • (arXiv 2022.08) Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model, [Paper], [Code]
  • (arXiv 2022.09) Transfer Learning with Pretrained Remote Sensing Transformers, [Paper], [Code]
  • (arXiv 2022.10) MCTNet: A Multi-Scale CNN-Transformer Network for Change Detection in Optical Remote Sensing Images, [Paper]
  • (arXiv 2022.10) Fully Transformer Network for Change Detection of Remote Sensing Images, [Paper], [Code]
  • (arXiv 2022.12) RCDT: Relational Remote Sensing Change Detection with Transformer, [Paper]
  • (arXiv 2023.04) Remote Sensing Change Detection With Transformers Trained from Scratch, [Paper]
  • (arXiv 2023.06) Lightweight Structure-aware Transformer Network for VHR Remote Sensing Image Change Detection, [Paper]
  • (arXiv 2023.06) CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud Detection Fusing Multiscale Features, [Paper]
  • (arXiv 2023.06) RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model, [Paper], [Code]
  • (arXiv 2023.07) General-Purpose Multimodal Transformer meets Remote Sensing ation, [Paper], [Code]
  • (arXiv 2023.07) Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution, [Paper]
  • (arXiv 2023.08) LEFormer: A Hybrid CNN-Transformer Architecture for Accurate Lake Extraction from Remote Sensing Imagery, [Paper]
  • (arXiv 2023.08) SwinV2DNet: Pyramid and Self-Supervision Compounded Feature Learning for Remote Sensing Images Change Detection, [Paper], [Code]
  • (arXiv 2023.08) RingMo-lite: A Remote Sensing Multi-task Lightweight Network with CNN-Transformer Hybrid Framework, [Paper], [Code]
  • (arXiv 2023.10) Efficient Remote Sensing Segmentation With Generative Adversarial Transformer, [Paper]
  • (arXiv 2023.10) HeightFormer: A Multilevel Interaction and Image-adaptive Classification-regression Network for Monocular Height Estimation with Aerial Images, [Paper], [Code]
  • (arXiv 2023.10) VcT: Visual change Transformer for Remote Sensing Image Change Detection, [Paper], [Code]
  • (arXiv 2023.10) Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images, [Paper]
  • (arXiv 2023.10) SolarFormer: Multi-scale Transformer for Solar PV Profiling, [Paper]
  • (arXiv 2023.11) CLiSA: A Hierarchical Hybrid Transformer Model using Orthogonal Cross Attention for Satellite Image Cloud Segmentation, [Paper]
  • (arXiv 2023.11) SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary Constraints, [Paper], [Code]

Restoration

  • (arXiv 2021.06) Uformer: A General U-Shaped Transformer for Image Restoration, [Paper], [Code]
  • (arXiv 2021.08) SwinIR: Image Restoration Using Swin Transformer, [Paper], [Code]
  • (arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]
  • (arXiv 2021.12) U2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper], [Code]
  • (arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]
  • (arXiv 2022.08) ELMformer: Efficient Raw Image Restoration with a Locally Multiplicative Transformer, [Paper], [Code]
  • (arXiv 2022.09) LRT: An Efficient Low-Light Restoration Transformer for Dark Light Field Images, [Paper]
  • (arXiv 2022.09) Dual-former: Hybrid Self-attention Transformer for Efficient Image Restoration, [Paper]
  • (arXiv 2022.10) Accurate Image Restoration with Attention Retractable Transformer, [Paper], [Code]
  • (arXiv 2022.11) Cross Aggregation Transformer for Image Restoration, [Paper], [Code]
  • (arXiv 2023.01) Towards Vision Transformer Unrolling Fixed-Point Algorithm: a Case Study on Image Restoration, [Paper]
  • (arXiv 2023.03) Retinal Image Restoration using Transformer and Cycle-Consistent Generative Adversarial Network, [Paper], [Code]
  • (arXiv 2023.03) SANDFORMER: CNN and Transformer under Gated Fusion for Sand Dust Image Restoration, [Paper]
  • (arXiv 2023.04) Burstormer: Burst Image Restoration and Enhancement Transformer, [Paper], [Code]
  • (arXiv 2023.05) RAMiT: Reciprocal Attention Mixing Transformer for Lightweight Image Restoration, [Paper]
  • (arXiv 2023.05) GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions, [Paper]
  • (arXiv 2023.07) On the unreasonable vulnerability of transformers for image restoration – and an easy fix, [Paper]
  • (arXiv 2023.08) Learning A Coarse-to-Fine Diffusion Transformer for Image Restoration, [Paper], [Code]
  • (arXiv 2023.09) Prompt-based All-in-One Image Restoration using CNNs and Transformer, [Paper], [Code]
  • (arXiv 2023.09) HAT: Hybrid Attention Transformer for Image Restoration, [Paper], [Code]
  • (arXiv 2023.12) ViStripformer: A Token-Efficient Transformer for Versatile Video Restoration, [Paper]

Retrieval

  • (CVPR'21') Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, [Paper]
  • (arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
  • (arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
  • (arXiv 2021.03) Instance-level Image Retrieval using Reranking Transformers, [Paper]
  • (arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]
  • (arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [Paper]
  • (arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper], [Code]
  • (arXiv 2021.06) Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features, [Paper]
  • (arXiv 2021.06) All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers, [Paper], [Code]
  • (arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]
  • (arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]
  • (arXiv 2022.07) TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval, [Paper], [Code]
  • (arXiv 2022.08) EViT: Privacy-Preserving Image Retrieval via Encrypted Vision Transformer in Cloud Computing, [Paper], [Code]
  • (arXiv 2022.10) ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval, [Paper]
  • (arXiv 2022.10) General Image Descriptors for Open World Image Retrieval using ViT CLIP, [Paper]
  • (arXiv 2022.10) Boosting vision transformers for image retrieval, [Paper], [Code]
  • (arXiv 2023.04) STIR: Siamese Transformer for Image Retrieval Postprocessing, [Paper], [Code]
  • (arXiv 2023.08) Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval, [Paper], [Code]
  • (arXiv 2023.10) GMMFormer: Gaussian-Mixture-Model based Transformer for Efficient Partially Relevant Video Retrieval, [Paper]

Robotic

  • (arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [Paper], [Code]
  • (arXiv 2022.02) When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection, [Paper], [Code]
  • (arXiv 2022.07) 3D Part Assembly Generation with Instance Encoded Transformer, [Paper]
  • (arXiv 2022.09) Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation, [Paper], [Project]
  • (arXiv 2022.09) PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training, [Paper]
  • (arXiv 2022.12) RT-1: Robotics Transformer for Real-World Control at Scale, [Paper], [Project]
  • (arXiv 2023.06) RVT: Robotic View Transformer for 3D Object Manipulation, [Paper], [Project]
  • (arXiv 2023.09) AnyOKP: One-Shot and Instance-Aware Object Keypoint Extraction with Pretrained ViT, [Paper]
  • (arXiv 2023.09) PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation, [Paper], [Project]
  • (arXiv 2023.10) Knolling bot: A Transformer-based Approach to Organizing a Messy Table, [Paper]
  • (arXiv 2023.11) M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place, [Paper], [Project]
  • (arXiv 2023.11) FViT-Grasp: Grasping Objects With Using Fast Vision Transformers, [Paper]

Salient Detection

  • (arXiv 2021.04) Transformer Transforms Salient Object Detection and Camouflaged Object Detection, [Paper]
  • (arXiv 2021.04) Visual Saliency Transformer, [Paper]
  • (arXiv 2021.04) CoSformer: Detecting Co-Salient Object with Transformers, [Paper]
  • (arXiv 2021.08) Unifying Global-Local Representations in Salient Object Detection with Transformer, [Paper], [Code]
  • (arXiv 2021.08) TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network, [Paper], [Code]
  • (arXiv 2021.08) Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net, [Paper]
  • (arXiv 2021.12) Transformer-based Network for RGB-D Saliency Detection, [Paper]
  • (arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]
  • (arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]
  • (arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection, [Paper]
  • (arXiv 2022.03) GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection, [Paper]
  • (arXiv 2022.03) Unsupervised Salient Object Detection with Spectral Cluster Voting, [Paper], [Code]
  • (arXiv 2022.05) SelfReformer: Self-Refined Network with Transformer for Salient Object Detection, [Paper]
  • (arXiv 2022.06) Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection, [Paper]
  • (arXiv 2022.07) TANet: Transformer-based Asymmetric Network for RGB-D Salient Object Detection, [Paper], [Code]
  • (arXiv 2022.07) Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection, [Paper], [Code]
  • (arXiv 2022.07) SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification, [Paper]
  • (arXiv 2022.07) Panoramic Vision Transformer for Saliency Detection in 360° Videos, [Paper]
  • (arXiv 2023.01) HRTransNet: HRFormer-Driven Two-Modality Salient Object Detection, [Paper], [Code]
  • (arXiv 2023.02) Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection, [Paper]
  • (arXiv 2023.05) Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection, [Paper], [Code]
  • (arXiv 2023.05) Salient Mask-Guided Vision Transformer for Fine-Grained Classification, [Paper]
  • (arXiv 2023.08) Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection, [Paper],[Code]
  • (arXiv 2023.08) Distortion-aware Transformer in 360° Salient Object Detection, [Paper],[Code]
  • (arXiv 2023.09) UniST: Towards Unifying Saliency Transformer for Video Saliency Prediction and Detection, [Paper]
  • (arXiv 2023.09) Salient Object Detection in Optical Remote Sensing Images Driven by Transformer, [Paper], [Code]
  • (arXiv 2023.10) VST++: Efficient and Stronger Visual Saliency Transformer, [Paper], [Code]

Scene

  • (arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
  • (arXiv 2021.05) SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation, [Paper]
  • (arXiv 2021.06) P2T: Pyramid Pooling Transformer for Scene Understanding, [Paper], [Code]
  • (arXiv 2021.07) Scenes and Surroundings: Scene Graph Generation using Relation Transformer, [Paper]
  • (arXiv 2021.07) Spatial-Temporal Transformer for Dynamic Scene Graph Generation, [Paper]
  • (arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]
  • (arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper]
  • (arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]
  • (arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]
  • (arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper], [Code]
  • (arXiv 2022.03) Relationformer: A Unified Framework for Image-to-Graph Generation, [Paper]
  • (arXiv 2022.05) ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions, [Paper], [Code]
  • (arXiv 2022.06) Object Scene Representation Transformer, [Paper], [Project]
  • (arXiv 2022.11) SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation, [Paper]
  • (arXiv 2022.11) Iterative Scene Graph Generation with Generative Transformers, [Paper]
  • (arXiv 2022.12) SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for Scene Graph Generation, [Paper]
  • (arXiv 2023.03) Transformer-based Image Generation from Scene Graphs, [Paper], [Code]
  • (arXiv 2023.03) Revisiting Transformer for Point Cloud-based 3D Scene Graph Generation, [Paper]
  • (arXiv 2023.03) Learning Similarity between Scene Graphs and Images with Transformers, [Paper]
  • (arXiv 2023.04) RePAST: Relative Pose Attention Scene Representation Transformer, [Paper]
  • (arXiv 2023.05) HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer, [Paper]
  • (arXiv 2023.05) PanoContext-Former: Panoramic Total Scene Understanding with a Transformer, [Paper]
  • (arXiv 2023.06) InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding, [Paper], [Code]
  • (arXiv 2023.06) ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining, [Paper], [Code]
  • (arXiv 2023.08) Generalized Unbiased Scene Graph Generation, [Paper]
  • (arXiv 2023.08) Vision Relation Transformer for Unbiased Scene Graph Generation, [Paper],[Code]
  • (arXiv 2023.09) RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing, [Paper],[Code]
  • (arXiv 2023.09) Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation, [Paper]
  • (arXiv 2023.10) Towards Grouping in Large Scenes with Occlusion-aware Spatio-temporal Transformers, [Paper],[Code]
  • (arXiv 2023.11) Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection, [Paper]
  • (arXiv 2023.11) TSP-Transformer: Task-Specific Prompts Boosted Transformer for Holistic Scene Understanding, [Paper],[Code]
  • (arXiv 2023.11) VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation, [Paper]
  • (arXiv 2023.12) Gaussian Grouping: Segment and Edit Anything in 3D Scenes, [Paper],[Code]

Self-supervised Learning

  • (arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper], [Code]
  • (arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]
  • (arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]], [Code]
  • (arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper], [Code]
  • (arXiv 2021.04) Emerging Properties in Self-Supervised Vision Transformers, [Paper], [Code]
  • (arXiv 2021.05) Self-Supervised Learning with Swin Transformers, [Paper], [Code]
  • (arXiv 2021.06) MST: Masked Self-Supervised Transformer for Visual Representation, [Paper]
  • (arXiv 2021.06) Efficient Self-supervised Vision Transformers for Representation Learning, [Paper]
  • (arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper]
  • (arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper], [Code]
  • (arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper], [Code]
  • (arXiv 2022.02) Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut, [Paper], [Project]
  • (arXiv 2022.03) Mugs: A Multi-Granular Self-Supervised Learning Framework, [Paper], [Code]
  • (arXiv 2022.04) A Transformer-Based Contrastive Learning Approach for Few-Shot Sign Language Recognition, [Paper]
  • (arXiv 2022.04) DILEMMA: Self-Supervised Shape and Texture Learning with Transformers, [Paper]
  • (arXiv 2022.04) Self-supervised Vision Transformers for Joint SAR-optical Representation Learning, [Paper]
  • (arXiv 2022.05) UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog, [Paper]
  • (arXiv 2022.05) Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality, [Paper], [Code]
  • (arXiv 2022.05) Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks, [Paper], [Code]
  • (arXiv 2022.05) A Closer Look at Self-supervised Lightweight Vision Transformers, [Paper]
  • (arXiv 2022.06) Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer, [Paper], [Code]
  • (arXiv 2022.06) Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning, [Paper], [Code]
  • (arXiv 2022.06) Exploring Feature Self-relation for Self-supervised Transformer, [Paper]
  • (arXiv 2022.06) Position Labels for Self-Supervised Vision Transformer, [Paper]
  • (arXiv 2022.06) Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency, [Paper], [Code]
  • (arXiv 2022.06) Patch-level Representation Learning for Self-supervised Vision Transformers, [Paper], [Code]
  • (arXiv 2022.07) Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning, [Paper], [Code]
  • (arXiv 2022.08) Self-Supervised Vision Transformers for Malware Detection, [Paper]
  • (arXiv 2022.09) Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers, [Paper]
  • (arXiv 2022.10) Attention Distillation: self-supervised vision transformer students need more guidance, [Paper], [Code]
  • (arXiv 2022.10) Histopathological Image Classification based on Self-Supervised Vision Transformer and Weak Labels, [Paper], [Code]
  • (arXiv 2022.10) Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers, [Paper], [Code]
  • (arXiv 2022.10) SSiT: Saliency-guided Self-supervised Image Transformer for Diabetic Retinopathy Grading, [Paper], [Code]
  • (arXiv 2022.10) PatchRot: A Self-Supervised Technique for Training Vision Transformers, [Paper], [Code]
  • (arXiv 2022.10) Foreign Object Debris Detection for Airport Pavement Images based on Self-supervised Localization and Vision Transformer, [Paper]
  • (arXiv 2022.12) Location-Aware Self-Supervised Transformers, [Paper], [Code]
  • (arXiv 2023.02) Real Estate Property Valuation using Self-Supervised Vision Transformers, [Paper]
  • (arXiv 2023.02) Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations, [Paper], [Code]
  • (arXiv 2023.03) ST-KeyS: Self-Supervised Transformer for Keyword Spotting in Historical Handwritten Documents, [Paper]
  • (arXiv 2023.03) AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+, [Paper], [Code]
  • (arXiv 2023.03) Contrastive Transformer: Contrastive Learning Scheme with Transformer innate Patches, [Paper]
  • (arXiv 2023.04) Token Boosting for Robust Self-Supervised Visual Transformer Pre-training, [Paper]
  • (arXiv 2023.04) MOST: Multiple Object localization with Self-supervised Transformers for object discovery, [Paper]
  • (arXiv 2023.05) LostPaw: Finding Lost Pets using a Contrastive Learning-based Transformer with Visual Input, [Paper]
  • (arXiv 2023.05) What Do Self-Supervised Vision Transformers Learn, [Paper], [Code]
  • (arXiv 2023.06) Improving Visual Prompt Tuning for Self-supervised Vision Transformers, [Paper], [Code]
  • (arXiv 2023.06) DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency, [Paper]
  • (arXiv 2023.07) Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot

About

A paper list of some recent Transformer-based CV works.