dk-liang / Awesome-Visual-Transformer

Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome Visual-Transformer Awesome

Collect some Transformer with Computer-Vision (CV) papers.

If you find some overlooked papers, please open issues or pull requests (recommended).

Papers

Transformer original paper

Technical blog

  • [English Blog] Transformers in Vision [Link]
  • [Chinese Blog] 3W字长文带你轻松入门视觉transformer [Link]
  • [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [Link]

Survey

  • Multimodal learning with transformers: A survey (IEEE TPAMI) [paper] - 2023.05.11
  • A Survey of Visual Transformers [paper] - 2021.11.30
  • Transformers in Vision: A Survey [paper] - 2021.02.22
  • A Survey on Visual Transformer [paper] - 2021.1.30
  • A Survey of Transformers [paper] - 2020.6.09

arXiv papers

  • Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive [paper]
  • [FocusedDecoder] Focused Decoding Enables 3D Anatomical Detection by Transformers [paper] [code]
  • [TAG] TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [paper] [code]
  • [FastMETRO] Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [paper] [code]
  • BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning [paper] [code]
  • [RelViT] RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [paper] [code]
  • [MViTv2] Improved Multiscale Vision Transformers for Classification and Detection [paper] [code]
  • DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection [paper] [code]
  • Three things everyone should know about Vision Transformers [paper]
  • [DeiT III] DeiT III: Revenge of the ViT [paper]
  • [DaViT] DaViT: Dual Attention Vision Transformers [paper] [code]
  • [CoFormer] Collaborative Transformers for Grounded Situation Recognition [paper] [code]
  • [GSRTR] Grounded Situation Recognition with Transformers [paper] [code]
  • [MaxViT] MaxViT: Multi-Axis Vision Transformer [paper]
  • [V2X-ViT] V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer [paper]
  • [MemMC-MAE] Unsupervised Anomaly Detection in Medical Images with a Memory-augmented Multi-level Cross-attentional Masked Autoencoder [paper] [code]
  • Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection [paper] [code]
  • [VideoMAE] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [paper] [code]
  • PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [paper]
  • ResViT: Residual vision transformers for multi-modal medical image synthesis [paper]
  • [CrossEfficientViT] Combining EfficientNet and Vision Transformers for Video Deepfake Detection [paper] [code]
  • [Discrete ViT] Discrete Representations Strengthen Vision Transformer Robustness [paper]
  • [StyleSwin] StyleSwin: Transformer-based GAN for High-resolution Image Generation [paper] [code]
  • [SReT] Sliced Recursive Transformer [paper] [code]
  • Dynamic Token Normalization Improves Vision Transformer [paper]
  • TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [paper] [code]
  • Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding [paper]
  • [ORViT] Object-Region Video Transformers [paper] [code]
  • Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation [paper] [code]
  • [NViT] NViT: Vision Transformer Compression and Parameter Redistribution [paper]
  • 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning [paper]
  • Adversarial Token Attacks on Vision Transformers [paper]
  • Contextual Transformer Networks for Visual Recognition [paper] [code]
  • [TranSalNet] TranSalNet: Visual saliency prediction using transformers [paper]
  • [MobileViT] MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer [paper]
  • A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition [paper]
  • [3D-Transformer] 3D-Transformer: Molecular Representation with Transformer in 3D Space [paper]
  • [CCTrans] CCTrans: Simplifying and Improving Crowd Counting with Transformer [paper]
  • [UFO-ViT] UFO-ViT: High Performance Linear Vision Transformer without Softmax [paper]
  • Sparse Spatial Transformers for Few-Shot Learning [paper]
  • Vision Transformer Hashing for Image Retrieval [paper]
  • [OH-Former] OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification [paper]
  • [Pix2seq] Pix2seq: A Language Modeling Framework for Object Detection [paper]
  • [CoAtNet] CoAtNet: Marrying Convolution and Attention for All Data Sizes [paper]
  • [LOTR] LOTR: Face Landmark Localization Using Localization Transformer [paper]
  • Transformer-Unet: Raw Image Processing with Unet [paper]
  • [GraFormer] GraFormer: Graph Convolution Transformer for 3D Pose Estimation [paper]
  • [CDTrans] CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation [paper]
  • PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds [paper] [code]
  • Anchor DETR: Query Design for Transformer-Based Detector [paper] [code]
  • [DAB-DETR] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [paper] [code]
  • [ESRT] Efficient Transformer for Single Image Super-Resolution [paper]
  • [MaskFormer] MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation [paper] [code]
  • [SwinIR] SwinIR: Image Restoration Using Swin Transformer [paper] [code]
  • [Trans4Trans] Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance [paper]
  • Do Vision Transformers See Like Convolutional Neural Networks? [paper]
  • Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net [paper]
  • Light Field Image Super-Resolution with Transformers [paper] [code]
  • Focal Self-attention for Local-Global Interactions in Vision Transformers [paper] [code]
  • Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers [paper] [code]
  • Mobile-Former: Bridging MobileNet and Transformer [paper]
  • [TriTransNet] TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network [paper]
  • [PSViT] PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [paper]
  • Boosting Few-shot Semantic Segmentation with Transformers [paper] [code]
  • Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer [paper]
  • Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [paper]
  • [Styleformer] Styleformer: Transformer based Generative Adversarial Networks with Style Vector [paper] [code]
  • [CMT] CMT: Convolutional Neural Networks Meet Vision Transformers [paper]
  • [TransAttUnet] TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation [paper]
  • TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation [paper]
  • [ViTGAN] ViTGAN: Training GANs with Vision Transformers [paper]
  • What Makes for Hierarchical Vision Transformer? [paper]
  • [Trans4Trans] Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World [paper]
  • [FFVT] Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [paper]
  • [TransformerFusion] TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [paper]
  • Escaping the Big Data Paradigm with Compact Transformers [paper]
  • How to train your ViT? Data, Augmentation,and Regularization in Vision Transformers [paper]
  • Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks [paper]
  • [XCiT] XCiT: Cross-Covariance Image Transformers [paper] [code]
  • Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [paper] [code]
  • Video Swin Transformer [paper] [code]
  • [VOLO] VOLO: Vision Outlooker for Visual Recognition [paper] [code]
  • Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images [paper]
  • End-to-end Temporal Action Detection with Transformer [paper] [code]
  • How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers [paper]
  • Efficient Self-supervised Vision Transformers for Representation Learning [paper]
  • Space-time Mixing Attention for Video Transformer [paper]
  • Transformed CNNs: recasting pre-trained convolutional layers with self-attention [paper]
  • [CAT] CAT: Cross Attention in Vision Transformer [paper]
  • Scaling Vision Transformers [paper]
  • [DETReg] DETReg: Unsupervised Pretraining with Region Priors for Object Detection [paper] [code]
  • Chasing Sparsity in Vision Transformers:An End-to-End Exploration [paper]
  • [MViT] MViT: Mask Vision Transformer for Facial Expression Recognition in the wild [paper]
  • Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [paper]
  • On Improving Adversarial Transferability of Vision Transformers [paper]
  • Fully Transformer Networks for Semantic ImageSegmentation [paper]
  • Visual Transformer for Task-aware Active Learning [paper] [code]
  • Efficient Training of Visual Transformers with Small-Size Datasets [paper]
  • Reveal of Vision Transformers Robustness against Adversarial Attacks [paper]
  • Person Re-Identification with a Locally Aware Transformer [paper]
  • [Refiner] Refiner: Refining Self-attention for Vision Transformers [paper]
  • [ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [paper]
  • Video Instance Segmentation using Inter-Frame Communication Transformers [paper]
  • Transformer in Convolutional Neural Networks [paper] [code]
  • [Uformer] Uformer: A General U-Shaped Transformer for Image Restoration [paper] [code]
  • Patch Slimming for Efficient Vision Transformers [paper]
  • [RegionViT] RegionViT: Regional-to-Local Attention for Vision Transformers [paper]
  • Associating Objects with Transformers for Video Object Segmentation [paper] [code]
  • Few-Shot Segmentation via Cycle-Consistent Transformer [paper]
  • Glance-and-Gaze Vision Transformer [paper] [code]
  • Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers [paper]
  • [DynamicViT] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [paper] [code]
  • When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations [paper] [code]
  • Unsupervised Out-of-Domain Detection via Pre-trained Transformers [paper]
  • [TransMIL] TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication [paper]
  • [TransVOS] TransVOS: Video Object Segmentation with Transformers [paper]
  • [KVT] KVT: k-NN Attention for Boosting Vision Transformers [paper]
  • [MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens [paper] [code]
  • [SegFormer] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [paper] [code]
  • [SDNet] SDNet: mutil-branch for single image deraining using swin [paper] [code]
  • [DVT] Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length [paper]
  • [GazeTR] Gaze Estimation using Transformer [paper] [code]
  • Transformer-Based Deep Image Matching for Generalizable Person Re-identification [paper]
  • Less is More: Pay Less Attention in Vision Transformers [paper]
  • [FoveaTer] FoveaTer: Foveated Transformer for Image Classification [paper]
  • [TransDA] Transformer-Based Source-Free Domain Adaptation [paper] [code]
  • An Attention Free Transformer [paper]
  • [PTNet] PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer [paper]
  • [ResT] ResT: An Efficient Transformer for Visual Recognition [paper] [code]
  • [CogView] CogView: Mastering Text-to-Image Generation via Transformers [paper]
  • [NesT] Aggregating Nested Transformers [paper]
  • [TAPG] Temporal Action Proposal Generation with Transformers [paper]
  • Boosting Crowd Counting with Transformers [paper]
  • [COTR] COTR: Convolution in Transformer Network for End to End Polyp Detection [paper]
  • [TransVOD] End-to-End Video Object Detection with Spatial-Temporal Transformers [paper] [code]
  • Intriguing Properties of Vision Transformers [paper] [code]
  • Combining Transformer Generators with Convolutional Discriminators [paper]
  • Rethinking the Design Principles of Robust Vision Transformer [paper]
  • Vision Transformers are Robust Learners [paper] [code]
  • Manipulation Detection in Satellite Images Using Vision Transformer [paper]
  • [Swin-Unet] Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [paper] [code]
  • Self-Supervised Learning with Swin Transformers [paper] [code]
  • [SCTN] SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [paper]
  • [RelationTrack] RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [paper]
  • [VGTR] Visual Grounding with Transformers [paper]
  • [PST] Visual Composite Set Detection Using Part-and-Sum Transformers [paper]
  • [TrTr] TrTr: Visual Tracking with Transformer [paper] [code]
  • [MOTR] MOTR: End-to-End Multiple-Object Tracking with TRansformer [paper] [code]
  • Attention for Image Registration (AiR): an unsupervised Transformer approach [paper]
  • [TransHash] TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [paper]
  • [ISTR] ISTR: End-to-End Instance Segmentation with Transformers [paper] [code]
  • [CAT] CAT: Cross-Attention Transformer for One-Shot Object Detection [paper]
  • [CoSformer] CoSformer: Detecting Co-Salient Object with Transformers [paper]
  • End-to-End Attention-based Image Captioning [paper]
  • [PMTrans] Pyramid Medical Transformer for Medical Image Segmentation [paper]
  • [HandsFormer] HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [paper]
  • [GasHis-Transformer] GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [paper]
  • Emerging Properties in Self-Supervised Vision Transformers [paper]
  • [InTra] Inpainting Transformer for Anomaly Detection [paper]
  • [Twins] Twins: Revisiting Spatial Attention Design in Vision Transformers [paper] [code]
  • [MLMSPT] Point Cloud Learning with Transformer [paper]
  • Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [paper]
  • [ConTNet] ConTNet: Why not use convolution and transformer at the same time? [paper] [code]
  • [DTNet] Dual Transformer for Point Cloud Analysis [paper]
  • Improve Vision Transformers Training by Suppressing Over-smoothing [paper] [code]
  • Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [paper]
  • [M3DeTR] M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [paper] [code]
  • [Skeletor] Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [paper]
  • [FaceT] Learning to Cluster Faces via Transformer [paper]
  • [MViT] Multiscale Vision Transformers [paper] [code]
  • [VATT] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [paper]
  • [So-ViT] So-ViT: Mind Visual Tokens for Vision Transformer [paper] [code]
  • Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [paper] [code]
  • [TransRPPG] TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [paper]
  • [VideoGPT] VideoGPT: Video Generation using VQ-VAE and Transformers [paper]
  • [M2TR] M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [paper]
  • Transformer Transforms Salient Object Detection and Camouflaged Object Detection [paper]
  • [TransCrowd] TransCrowd: Weakly-Supervised Crowd Counting with Transformer [paper] [code]
  • Visual Transformer Pruning [paper]
  • Self-supervised Video Retrieval Transformer Network [paper]
  • Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [paper]
  • [TransGAN] TransGAN: Two Transformers Can Make One Strong GAN [paper] [code]
  • Geometry-Free View Synthesis: Transformers and no 3D Priors [paper] [code]
  • [CoaT] Co-Scale Conv-Attentional Image Transformers [paper] [code]
  • [LocalViT] LocalViT: Bringing Locality to Vision Transformers [paper] [code]
  • [CIT] Cloth Interactive Transformer for Virtual Try-On [paper] [code]
  • Handwriting Transformers [paper]
  • [SiT] SiT: Self-supervised vIsion Transformer [paper] [code]
  • On the Robustness of Vision Transformers to Adversarial Examples [paper]
  • An Empirical Study of Training Self-Supervised Visual Transformers [paper]
  • A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [paper]
  • [AOT-GAN] Aggregated Contextual Transformations for High-Resolution Image Inpainting [paper] [code]
  • Deepfake Detection Scheme Based on Vision Transformer and Distillation [paper]
  • [ATAG] Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [paper]
  • [TubeR] TubeR: Tube-Transformer for Action Detection [paper]
  • [AAformer] AAformer: Auto-Aligned Transformer for Person Re-Identification [paper]
  • [TFill] TFill: Image Completion via a Transformer-Based Architecture [paper]
  • Group-Free 3D Object Detection via Transformers [paper] [code]
  • [STGT] Spatial-Temporal Graph Transformer for Multiple Object Tracking [paper]
  • Going deeper with Image Transformers[paper]
  • [Meta-DETR] Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [paper [code]
  • [DA-DETR] DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [paper]
  • Robust Facial Expression Recognition with Convolutional Visual Transformers [paper]
  • Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [paper]
  • Spatiotemporal Transformer for Video-based Person Re-identification[paper]
  • [TransUNet] TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [paper] [code]
  • [CvT] CvT: Introducing Convolutions to Vision Transformers [paper] [code]
  • [TFPose] TFPose: Direct Human Pose Estimation with Transformers [paper]
  • [TransCenter] TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [paper]
  • Face Transformer for Recognition [paper]
  • On the Adversarial Robustness of Visual Transformers [paper]
  • Understanding Robustness of Transformers for Image Classification [paper]
  • Lifting Transformer for 3D Human Pose Estimation in Video [paper]
  • [GSA-Net] Global Self-Attention Networks for Image Recognition[paper]
  • High-Fidelity Pluralistic Image Completion with Transformers [paper] [code]
  • [DPT] Vision Transformers for Dense Prediction [paper] [code]
  • [TransFG] TransFG: A Transformer Architecture for Fine-grained Recognition? [paper]
  • [TimeSformer] Is Space-Time Attention All You Need for Video Understanding? [paper]
  • Multi-view 3D Reconstruction with Transformer [paper]
  • Can Vision Transformers Learn without Natural Images? [paper] [code]
  • End-to-End Trainable Multi-Instance Pose Estimation with Transformers [paper]
  • Instance-level Image Retrieval using Reranking Transformers [paper] [code]
  • [BossNAS] BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [paper] [code]
  • [CeiT] Incorporating Convolution Designs into Visual Transformers [paper]
  • [DeepViT] DeepViT: Towards Deeper Vision Transformer [paper]
  • Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [paper]
  • 3D Human Pose Estimation with Spatial and Temporal Transformers [paper] [code]
  • [SUNETR] SUNETR: Transformers for 3D Medical Image Segmentation [paper]
  • Scalable Visual Transformers with Hierarchical Pooling [paper]
  • [ConViT] ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [paper]
  • [TransMed] TransMed: Transformers Advance Multi-modal Medical Image Classification [paper]
  • [U-Transformer] U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [paper]
  • [SpecTr] SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [paper] [code]
  • [TransBTS] TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [paper] [code]
  • [SSTN] SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving [paper]
  • Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer [paper] [code]
  • [CPVT] Do We Really Need Explicit Position Encodings for Vision Transformers? [paper] [code]
  • Deepfake Video Detection Using Convolutional Vision Transformer[paper]
  • Training Vision Transformers for Image Retrieval[paper]
  • [VTN] Video Transformer Network[paper]
  • [BoTNet] Bottleneck Transformers for Visual Recognition [paper]
  • [CPTR] CPTR: Full Transformer Network for Image Captioning [paper]
  • Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [paper] [code]
  • [Trans2Seg] Segmenting Transparent Object in the Wild with Transformer [paper] [code]
  • Investigating the Vision Transformer Model for Image Retrieval Tasks [paper]
  • [Trear] Trear: Transformer-based RGB-D Egocentric Action Recognition [paper]
  • [VisualSparta] VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [paper]
  • [TrackFormer] TrackFormer: Multi-Object Tracking with Transformers [paper]
  • [TAPE] Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [paper]
  • [TRIQ] Transformer for Image Quality Assessment [paper] [code]
  • [TransTrack] TransTrack: Multiple-Object Tracking with Transformer [paper] [code]
  • [DeiT] Training data-efficient image transformers & distillation through attention [paper] [code]
  • [Pointformer] 3D Object Detection with Pointformer [paper]
  • [ViT-FRCNN] Toward Transformer-Based Object Detection [paper]
  • [Taming-transformers] Taming Transformers for High-Resolution Image Synthesis [paper] [code]
  • [SceneFormer] SceneFormer: Indoor Scene Generation with Transformers [paper]
  • [PCT] PCT: Point Cloud Transformer [paper]
  • [PED] DETR for Pedestrian Detection[paper]
  • Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[paper]
  • [C-Tran] General Multi-label Image Classification with Transformers [paper]

2022

TPAMI

  • [P2T] P2T: Pyramid Pooling Transformer for Scene Understanding [paper]

ECCV

  • [X-CLIP] Expanding Language-Image Pretrained Models for General Video Recognition [paper] [code]
  • [TinyViT] TinyViT: Fast Pretraining Distillation for Small Vision Transformers [paper] [code]
  • [FastMETRO] Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers [paper] [code]
  • [AiATrack] AiATrack: Attention in Attention for Transformer Visual Tracking [paper] [code]
  • [OSTrack] Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework [paper] [code]
  • [Unicorn] Towards Grand Unification of Object Tracking [paper] [code]
  • [P3AFormer] Tracking Objects as Pixel-wise Distributions [paper] [code]

CVPR

  • [MAE] Masked Autoencoders Are Scalable Vision Learners [paper] [code]
  • CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [paper] [code]
  • Fast Point Transformer [paper]
  • EDTER: Edge Detection With Transformer [paper] [code]
  • Bridged Transformer for Vision and Point Cloud 3D Object Detection [paper]
  • MNSRNet: Multimodal Transformer Network for 3D Surface Super-Resolution [paper]
  • HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening [paper] [code]
  • Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation [paper]
  • MPViT: Multi-Path Vision Transformer for Dense Prediction [paper] [code]
  • A-ViT: Adaptive Tokens for Efficient Vision Transformer [paper]
  • TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [paper] [code]
  • Continual Learning With Lifelong Vision Transformer [paper]
  • Swin Transformer V2: Scaling Up Capacity and Resolution [paper] [code]
  • Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection From Point Clouds [paper] [code]
  • Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation [paper]
  • Human-Object Interaction Detection via Disentangled Transformer [paper]
  • LGT-Net: Indoor Panoramic Room Layout Estimation With Geometry-Aware Transformer Network [paper]
  • Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning [paper]
  • Vision Transformer With Deformable Attention [paper]
  • DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers [paper]
  • [Restormer] Restormer: Efficient Transformer for High-Resolution Image Restoration [paper] [code]
  • [SAM-DETR] Accelerating DETR Convergence via Semantic-Aligned Matching [paper] [code]
  • [BEVT] BEVT: BERT Pretraining of Video Transformers [paper] [code]
  • [MobileFormer] Mobile-Former: Bridging MobileNet and Transformer [paper]
  • [STRM] Spatio-temporal Relation Modeling for Few-shot Action Recognition [paper] [code]
  • [MiniViT] MiniViT: Compressing Vision Transformers with Weight Multiplexing [paper] [code]
  • [CoFormer] Collaborative Transformers for Grounded Situation Recognition [paper] [code]
  • [DW-ViT] Beyond Fixation: Dynamic Window Visual Transformer [paper] [code]
  • [TokenFusion] Multimodal Token Fusion for Vision Transformers [paper]
  • [CMT] Convolutional Neural Networks Meet Vision Transformers [paper]
  • Fine-tuning Image Transformers using Learnable Memory [paper]
  • [TransMix] Attend to Mix for Vision Transformers [paper] [code]
  • [NomMer] Nominate Synergistic Context in Vision Transformer for Visual Recognition [paper] [code]
  • [SSA] Shunted Self-Attention via Multi-Scale Token Aggregation [paper] [code]
  • [RVT] Towards Robust Vision Transformer [paper [code]
  • [LVT] Lite Vision Transformer with Enhanced Self-Attention [paper [code]
  • [StyTr2] StyTr2: Image Style Transfer with Transformers [paper] [code]

WACV

  • Image-Adaptive Hint Generation via Vision Transformer for Outpainting [paper] [code]

ICLR

  • [RelViT] RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [paper] [code]

  • [CrossFormer] CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention [paper] [code]

  • Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning [paper] [code]

  • [DAB-DETR] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [paper] [code]

2021

NeurIPS

  • ProTo: Program-Guided Transformer for Program-Guided Tasks [paper] [code]
  • [Augvit] Augmented Shortcuts for Vision Transformers [paper] [code]
  • [YOLOS] You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection [paper] [code]
  • [CATs] Semantic Correspondence with Transformers [paper] [code]
  • [Moment-DETR] QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [paper] [code]
  • Dual-stream Network for Visual Recognition [paper] [code]
  • [Container] Container: Context Aggregation Network [paper] [code]
  • [TNT] Transformer in Transformer [paper] [code]
  • T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression [paper]
  • Long Short-Term Transformer for Online Action Detection [paper]
  • TransformerFusion: Monocular RGB Scene Reconstruction using Transformers [paper]
  • TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification [paper]
  • TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification [paper]
  • Associating Objects with Transformers for Video Object Segmentation [paper]
  • Test-Time Personalization with a Transformer for Human Pose Estimation [paper]
  • Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning [paper]
  • Dynamic Grained Encoder for Vision Transformers [paper]
  • HRFormer: High-Resolution Vision Transformer for Dense Predict [paper]
  • Searching the Search Space of Vision Transformer [paper]
  • Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition [paper]
  • SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers [paper]
  • Do Vision Transformers See Like Convolutional Neural Networks? [paper]
  • Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers [paper]
  • Glance-and-Gaze Vision Transformer [paper]
  • MST: Masked Self-Supervised Transformer for Visual Representation [paper]
  • DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [paper]
  • TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up [paper]
  • Augmented Shortcuts for Vision Transformers [paper]
  • Improved Transformer for High-Resolution GANs [paper]
  • All Tokens Matter: Token Labeling for Training Better Vision Transformers [paper]
  • XCiT: Cross-Covariance Image Transformers [paper]
  • Efficient Training of Visual Transformers with Small Datasets [paper]

ICCV

  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (Marr Prize) [paper] [code]
  • [ICT] High-Fidelity Pluralistic Image Completion with Transformers [paper] [code]
  • [PoinTr] PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers (oral) [paper] [code]
  • [STTR] Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [paper] [code]
  • [TSP-FCOS] Rethinking Transformer-based Set Prediction for Object Detection [paper]
  • Paint Transformer: Feed Forward Neural Painting with Stroke Prediction (oral) ) [paper [code]
  • 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds [paper]
  • [T2T-ViT] Training Vision Transformers from Scratch on ImageNet [paper] [code]
  • [THUNDR] THUNDR: Transformer-Based 3D Human Reconstruction With Markers [paper]
  • Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [paper]
  • [PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [paper] [code]
  • Spatial-Temporal Transformer for Dynamic Scene Graph Generation [paper]
  • [GLiT] GLiT: Neural Architecture Search for Global and Local Image Transformer [paper]
  • [TRAR] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering [paper]
  • [UniT] UniT: Multimodal Multitask Learning With a Unified Transformer [paper] [code]
  • Stochastic Transformer Networks With Linear Competing Units: Application To End-to-End SL Translation [paper]
  • Transformer-Based Dual Relation Graph for Multi-Label Image Recognition [paper]
  • [LocalTrans] LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation [paper]
  • Improving 3D Object Detection With Channel-Wise Transformer [paper]
  • A Latent Transformer for Disentangled Face Editing in Images and Videos [paper] [code]
  • [GroupFormer] GroupFormer: Group Activity Recognition With Clustered Spatial-Temporal Transformer [paper]
  • Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue [paper]
  • [WB-DETR] WB-DETR: Transformer-Based Detector Without Backbone [paper]
  • The Animation Transformer: Visual Correspondence via Segment Matching [paper]
  • The Animation Transformer: Visual Correspondence via Segment Matching [paper]
  • Relaxed Transformer Decoders for Direct Action Proposal Generation [paper]
  • [PPT-Net] Pyramid Point Cloud Transformer for Large-Scale Place Recognition [paper] [code]
  • Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images [paper]
  • Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection [paper]
  • Image Harmonization With Transformer [paper] [cpde]
  • [COTR] COTR: Correspondence Transformer for Matching Across Images [paper]
  • [MUSIQ] MUSIQ: Multi-Scale Image Quality Transformer [paper]
  • Episodic Transformer for Vision-and-Language Navigation [paper]
  • Action-Conditioned 3D Human Motion Synthesis With Transformer VAE [paper]
  • [CrackFormer] CrackFormer: Transformer Network for Fine-Grained Crack Detection [paper]
  • [HiT] HiT: Hierarchical Transformer With Momentum Contrast for Video-Text Retrieval [paper]
  • Event-Based Video Reconstruction Using Transformer [paper]
  • [STVGBert] STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding [paper]
  • [HiFT] HiFT: Hierarchical Feature Transformer for Aerial Tracking [paper] [code]
  • [DocFormer] DocFormer: End-to-End Transformer for Document Understanding [paper]
  • [LeViT] LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [paper] [code]
  • [SignBERT] SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition[paper]
  • [VidTr] VidTr: Video Transformer Without Convolutions [paper]
  • [ACTOR] Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [paper]
  • [Segmenter] Segmenter: Transformer for Semantic Segmentation [paper] [code]
  • [Visformer] Visformer: The Vision-friendly Transformer [paper] [code]
  • [PnP-DETR] PnP-DETR: Towards Efficient Visual Analysis with Transformers (ICCV) [paper] [code]
  • [VoTr] Voxel Transformer for 3D Object Detection [paper]
  • [TransVG] TransVG: End-to-End Visual Grounding with Transformers [paper]
  • [3DETR] An End-to-End Transformer Model for 3D Object Detection [paper] [code]
  • [Eformer] Eformer: Edge Enhancement based Transformer for Medical Image Denoising [paper]
  • [TransFER] TransFER: Learning Relation-aware Facial Expression Representations with Transformers [paper]
  • [Oriented RCNN] Oriented Object Detection with Transformer [paper]
  • [ViViT] ViViT: A Video Vision Transformer [paper]
  • [Stark] Learning Spatio-Temporal Transformer for Visual Tracking [paper] [code]
  • [CT3D] Improving 3D Object Detection with Channel-wise Transformer [paper]
  • [VST] Visual Saliency Transformer [paper]
  • [PiT] Rethinking Spatial Dimensions of Vision Transformers [paper] [code]
  • [CrossViT] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [paper] [code]
  • [PointTransformer] Point Transformer [paper]
  • [TS-CAM] TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [paper] [code]
  • [VTs] Visual Transformers: Token-based Image Representation and Processing for Computer Vision [paper]
  • [TransDepth] Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction [paper] [code]
  • [Conditional DETR] Conditional DETR for Fast Training Convergence [paper] [code]
  • [PIT] PIT: Position-Invariant Transform for Cross-FoV Domain Adaptation [paper] [code]
  • [SOTR] SOTR: Segmenting Objects with Transformers [paper] [code]
  • [SnowflakeNet] SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer [paper] [code]
  • [TransPose] TransPose: Keypoint Localization via Transformer [paper] [code]
  • [TransReID] TransReID: Transformer-based Object Re-Identification [paper] [code]
  • [CWT] Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer [paper] [code]
  • Anticipative Video Transformer [paper] [code]
  • Rethinking and Improving Relative Position Encoding for Vision Transformer [paper] [code]
  • Vision Transformer with Progressive Sampling [paper] [code]
  • [SMCA] Fast Convergence of DETR with Spatially Modulated Co-Attention [paper] [code]
  • [AutoFormer] AutoFormer: Searching Transformers for Visual Recognition [paper] [code]

CVPR

  • Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer [paper]
  • [HOTR] HOTR: End-to-End Human-Object Interaction Detection with Transformers (oral) [paper]
  • [METRO] End-to-End Human Pose and Mesh Reconstruction with Transformers [paper]
  • [LETR] Line Segment Detection Using Transformers without Edges [paper]
  • [TransFuser] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [paper] [code]
  • Pose Recognition with Cascade Transformers [paper]
  • Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [paper]
  • [LoFTR] LoFTR: Detector-Free Local Feature Matching with Transformers [paper] [code]
  • Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [paper]
  • [SETR] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [paper] [code]
  • [TransT] Transformer Tracking [paper] [code]
  • Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (** oral**) [paper]
  • [VisTR] End-to-End Video Instance Segmentation with Transformers [paper]
  • Transformer Interpretability Beyond Attention Visualization [paper] [code]
  • [IPT] Pre-Trained Image Processing Transformer [paper]
  • [UP-DETR] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [paper]
  • [IQT] Perceptual Image Quality Assessment with Transformers (workshop) [paper]
  • High-Resolution Complex Scene Synthesis with Transformers (workshop) [paper]
  • [CoFormer] Collaborative Transformers for Grounded Situation Recognition [paper] [code]

ICML

  • Generative Video Transformer: Can Objects be the Words? [paper]
  • [GANsformer] Generative Adversarial Transformers [paper] [code]

ICRA

  • [NDT-Transformer] NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation [paper]

ICLR

  • [VTNet] VTNet: Visual Transformer Network for Object Goal Navigation [paper]
  • [Vision Transformer] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [paper] [code]
  • [Deformable DETR] Deformable DETR: Deformable Transformers for End-to-End Object Detection [paper] [code]
  • [LAMBDANETWORKS] MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION [paper] [code]

ACM MM

  • Video Transformer for Deepfake Detection with Incremental Learning[paper]
  • [HAT] HAT: Hierarchical Aggregation Transformers for Person Re-identification [paper]
  • Token Shift Transformer for Video Classification [paper] [code]
  • [DPT] DPT: Deformable Patch-based Transformer for Visual Recognition [paper] [code]

MICCAI

  • [UTNet] UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [paper] [code]
  • [MedT] Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [paper] [code]
  • [MCTrans] Multi-Compound Transformer for Accurate Biomedical Image Segmentation [paper] [code]
  • [PNS-Net] Progressively Normalized Self-Attention Network for Video Polyp Segmentation [paper] [code]
  • [MBT-Net] A Multi-Branch Hybrid Transformer Networkfor Corneal Endothelial Cell Segmentation [paper]

BMVC

  • [ACT] End-to-End Object Detection with Adaptive Clustering Transformer [paper]
  • [GSRTR] Grounded Situation Recognition with Transformers [paper] [code]
  • [TransFusion] TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation [paper] [code]

ISIE

  • VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization (ISIE) [paper]

CORL

  • [DETR3D] DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [paper]

IJCAI

  • Medical Image Segmentation using Squeeze-and-Expansion Transformers [paper]

IROS

  • [YOGO] You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module (IROS) [paper] [code]
  • [PTT] PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds [paper] [code]

WACV

  • [LSTR] End-to-end Lane Shape Prediction with Transformers [paper] [code]

ICDAR

  • Vision Transformer for Fast and Efficient Scene Text Recognition [paper]

2020

  • [DETR] End-to-End Object Detection with Transformers (ECCV) [paper] [code]
  • [FPT] Feature Pyramid Transformer (CVPR) [paper] [code]

Other resource

Acknowledgement

Thanks the template from Awesome-Crowd-Counting

About

Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)