Wameh / Transformer-in-Computer-Vision

A paper list of some recent Transformer-based CV works.

Transformer-in-Vision

A paper list of some recent Transformer-based CV works. If you find some ignored papers, please open issues or pull requests.

**Last updated: 2021/10/07

Update log

2021/April - update all of recent papers of Transformer-in-Vision.
2021/May - update all of recent papers of Transformer-in-Vision.
2021/June - update all of recent papers of Transformer-in-Vision.
2021/July - update all of recent papers of Transformer-in-Vision.
2021/August - update all of recent papers of Transformer-in-Vision.
2021/September - update all of recent papers of Transformer-in-Vision.

Survey:

(arXiv 2021.09) Survey: Transformer based Video-Language Pre-training. [Paper]
(arXiv 2021.03) Multi-modal Motion Prediction with Stacked Transformers. [Paper], [Code]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision. [Paper]
(arXiv 2020.09) Efficient Transformers: A Survey. [Paper]
(arXiv 2020.01) Transformers in Vision: A Survey. [Paper]

Recent Papers

Action

(CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]
(arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]
(arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]
(arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]
(arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]
(arXiv 2021.05) Temporal Action Proposal Generation with Transformers, [Paper]
(arXiv 2021.06) End-to-end Temporal Action Detection with Transformer, [Paper], [Code]
(arXiv 2021.06) OadTR: Online Action Detection with Transformers, [Paper], [Code]
(arXiv 2021.07) Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition, [Paper]
(arXiv 2021.07) VideoLightFormer: Lightweight Action Recognition using Transformers, [Paper]
(arXiv 2021.07) Long Short-Term Transformer for Online Action Detection, [Paper]
(arXiv 2021.07) STAR: Sparse Transformer-based Action Recognition, [Paper], [Code]
(arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]
(arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]
(arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper], [Code]
(arXiv 2021.10) Lightweight Transformer in Federated Setting for Human Activity Recognition, [Paper]

Active Learning

(arXiv 2021.06) Visual Transformer for Task-aware Active Learning, [Paper], [Code]

Anomaly Detection

(arXiv 2021.04) VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization, [Paper]
(arXiv 2021.04) Inpainting Transformer for Anomaly Detection, [Paper]

Assessment

(arXiv 2021.01) Transformer for Image Quality Assessment, [Paper], [Code]
(arXiv 2021.04) Perceptual Image Quality Assessment with Transformers, [Paper], [Code]
(arXiv 2021.08) No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency, [Paper], [Code]
(arXiv 2021.08) MUSIQ: Multi-scale Image Quality Transformer, [Paper], [Code]
(arXiv 2021.10) VTAMIQ: Transformers for Attention Modulated Image Quality Assessment, [Paper]

Captioning

(arXiv 2021.01) CPTR: Full Transformer Network for Image Captioning, [Paper]
(arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]
(arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]
(arXiv 2021.06) Semi-Autoregressive Transformer for Image Captioning, [Paper], [Code]
(arXiv 2021.08) Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers, [Paper]
(arXiv 2021.08) Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning, [Paper], [Code]
(arXiv 2021.09) Bornon: Bengali Image Captioning with Transformer-based Deep learning approach, [Paper]
(arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper], [Code]
(arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]
(arXiv 2021.10) Geometry Attention Transformer with Position-aware LSTMs for Image Captioning, [Paper]

Classification (Backbone)

(ICLR'21) MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]
(CVPR'20) Feature Pyramid Transformer, [Paper], [Code]
(ICLR'21) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]
(arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]
(arXiv 2020.11) General Multi-label Image Classification with Transformers, [Paper]
(arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper], [Code]
(arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]
(arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper] , [Code]
(arXiv.2021.02) Conditional Positional Encodings for Vision Transformers, [Paper], [Code]
(arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]
(arXiv 2021.03) Transformer in Transformer, [Paper], [Code]
(arXiv 2021.03) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases, [Paper], [Code]
(arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]
(arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]
(arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]
(arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]
(arXiv 2021.03) Understanding Robustness of Transformers for Image Classification, [Paper]
(arXiv 2021.03) Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, [Paper]
(arXiv 2021.03) CvT: Introducing Convolutions to Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Rethinking Spatial Dimensions of Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Going deeper with Image Transformers, [Paper]
(arXiv 2021.04) LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference, [Paper]
(arXiv 2021.04) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]
(arXiv 2021.04) LocalViT: Bringing Locality to Vision Transformers, [Paper], [Code]
(arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]
(arXiv 2021.04) Co-Scale Conv-Attentional Image Transformers, [Paper], [Code]
(arXiv 2021.04) Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet, [Paper], [Code]
(arXiv 2021.04) So-ViT: Mind Visual Tokens for Vision Transformer, [Paper]
(arXiv 2021.04) Multiscale Vision Transformers, [Paper], [Code]
(arXiv 2021.04) Visformer: The Vision-friendly Transformer, [Paper], [Code]
(arXiv 2021.04) Improve Vision Transformers Training by Suppressing Over-smoothing, [Paper], [Code]
(arXiv 2021.04) Twins: Revisiting the Design of Spatial Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.04) ConTNet: Why not use convolution and transformer at the same time, [Paper], [Code]
(arXiv 2021.05) Rethinking the Design Principles of Robust Vision Transformer, [Paper], [Code]
(arXiv 2021.05) Vision Transformers are Robust Learners, [Paper], [Code]
(arXiv 2021.05) Rethinking Skip Connection with Layer Normalization in Transformers and ResNets, [Paper], [Code]
(arXiv 2021.05) Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead, [Paper]
(arXiv 2021.05) Intriguing Properties of Vision Transformers, [Paper], [Code]
(arXiv 2021.05) Aggregating Nested Transformers, [Paper]
(arXiv 2021.05) ResT: An Efficient Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.06) DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, [Paper], [Code]
(arXiv 2021.06) When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations, [Paper]
(arXiv 2021.06) Container: Context Aggregation Network, [Paper]
(arXiv 2021.06) TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication, [Paper]
(arXiv 2021.06) KVT: k-NN Attention for Boosting Vision Transformers, [Paper]
(arXiv 2021.06) MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens, [Paper], [Code]
(arXiv 2021.06) Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length, [Paper]
(arXiv 2021.06) Less is More: Pay Less Attention in Vision Transformers, [Paper]
(arXiv 2021.06) FoveaTer: Foveated Transformer for Image Classification, [Paper]
(arXiv 2021.06) An Attention Free Transformer, [Paper]
(arXiv 2021.06) Glance-and-Gaze Vision Transformer, [Paper], [Code]
(arXiv 2021.06) RegionViT: Regional-to-Local Attention for Vision Transformers, [Paper]
(arXiv 2021.06) Chasing Sparsity in Vision Transformers: An End-to-End Exploration, [Paper], [Code]
(arXiv 2021.06) Scaling Vision Transformers, [Paper]
(arXiv 2021.06) CAT: Cross Attention in Vision Transformer, [Paper], [Code]
(arXiv 2021.06) On Improving Adversarial Transferability of Vision Transformers, [Paper], [Code]
(arXiv 2021.06) Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight, [Paper]
(arXiv 2021.06) Patch Slimming for Efficient Vision Transformers, [Paper]
(arXiv 2021.06) Transformer in Convolutional Neural Networks, [Paper], [Code]
(arXiv 2021.06) ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias, [Paper], [Code]
(arXiv 2021.06) Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, [Paper]
(arXiv 2021.06) Refiner: Refining Self-attention for Vision Transformers, [Paper]
(arXiv 2021.06) Reveal of Vision Transformers Robustness against Adversarial Attacks, [Paper]
(arXiv 2021.06) Efficient Training of Visual Transformers with Small-Size Datasets, [Paper]
(arXiv 2021.06) MlTr: Multi-label Classification with Transformer, [Paper], [Code]
(arXiv 2021.06) Delving Deep into the Generalization of Vision Transformers under Distribution Shifts, [Paper]
(arXiv 2021.06) BEIT: BERT Pre-Training of Image Transformers, [Paper], [Code]
(arXiv 2021.06) XCiT: Cross-Covariance Image Transformers, [Paper], [Code]
(arXiv 2021.06) How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers, [Paper], [Code1], [Code2]
(arXiv 2021.06) Exploring Vision Transformers for Fine-grained Classification, [Paper], [Code]
(arXiv 2021.06) TokenLearner: What Can 8 Learned Tokens Do for Images and Videos, [Paper]
(arXiv 2021.06) Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers, [Paper], [Code]
(arXiv 2021.06) VOLO: Vision Outlooker for Visual Recognition, [Paper], [Code]
(arXiv 2021.06) IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers, [Paper], [Project]
(arXiv 2021.06) PVTv2: Improved Baselines with Pyramid Vision Transformer, [Paper], [Code]
(arXiv 2021.06) Early Convolutions Help Transformers See Better, [Paper]
(arXiv 2021.06) Post-Training Quantization for Vision Transformer, [Paper]
(arXiv 2021.06) Multi-Exit Vision Transformer for Dynamic Inference, [Paper]
(arXiv 2021.07) Augmented Shortcuts for Vision Transformers, [Paper]
(arXiv 2021.07) Improving the Efficiency of Transformers for Resource-Constrained Devices, [Paper]
(arXiv 2021.07) CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, [Paper], [Code]
(arXiv 2021.07) Focal Self-attention for Local-Global Interactions in Vision Transformers, [Paper]
(arXiv 2021.07) Cross-view Geo-localization with Evolving Transformer, [Paper]
(arXiv 2021.07) What Makes for Hierarchical Vision Transformer, [Paper]
(arXiv 2021.07) Efficient Vision Transformers via Fine-Grained Manifold Distillation, [Paper]
(arXiv 2021.07) Vision Xformers: Efficient Attention for Image Classification, [Paper]
(arXiv 2021.07) Long-Short Transformer: Efficient Transformers for Language and Vision, [Paper]
(arXiv 2021.07) Feature Fusion Vision Transformer for Fine-Grained Visual Categorization, [Paper]
(arXiv 2021.07) Local-to-Global Self-Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.07) Visual Parser: Representing Part-whole Hierarchies with Transformers, [Paper], [Code]
(arXiv 2021.07) CMT: Convolutional Neural Networks Meet Vision Transformers, [Paper]
(arXiv 2021.07) Combiner: Full Attention Transformer with Sparse Computation Cost, [Paper]
(arXiv 2021.07) A Comparison of Deep Learning Classification Methods on Small-scale Image Data set: from Converlutional Neural Networks to Visual Transformers, [Paper]
(arXiv 2021.07) Query2Label: A Simple Transformer Way to Multi-Label Classification, [Paper], [Code]
(arXiv 2021.07) Contextual Transformer Networks for Visual Recognition, [Paper], [Code]
(arXiv 2021.07) Rethinking and Improving Relative Position Encoding for Vision Transformer, [Paper], [Code]
(arXiv 2021.08) CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention, [Paper], [Code]
(arXiv 2021.08) Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer, [Paper]
(arXiv 2021.08) Vision Transformer with Progressive Sampling, [Paper], [Code]
(arXiv 2021.08) Armour: Generalizable Compact Self-Attention for Vision Transformers, [Paper]
(arXiv 2021.08) ConvNets vs. Transformers: Whose Visual Representations are More Transferable, [Paper]
(arXiv 2021.08) Mobile-Former: Bridging MobileNet and Transformer, [Paper]
(arXiv 2021.08) Do Vision Transformers See Like Convolutional Neural Networks, [Paper]
(arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]
(arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]
(arXiv 2021.08) Scaled ReLU Matters for Training Vision Transformers, [Paper]
(arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]
(arXiv 2021.09) DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers, [Paper], [Code]
(arXiv 2021.09) Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers, [Paper]
(arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]
(arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2021.10) MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer, [Paper]
(arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]

Completion

(arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]
(arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]

Crowd Counting

(arXiv 2021.04) TransCrowd: Weakly-Supervised Crowd Counting with Transformer, [Paper], [Code]
(arXiv 2021.05) Boosting Crowd Counting with Transformers, [Paper], [Code]
(arXiv 2021.08) Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer, [Paper]
(arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper], [Code]
(arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]

Depth

(arXiv 2020.11) Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [Paper], [Code]
(arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]
(arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]
(arXiv 2021.09) Improving 360 Monocular Depth Estimation via Non-local Dense Prediction Transformer and Joint Supervised and Self-supervised Learning, [Paper]

Deepfake Detection

(arXiv.2021.02) Deepfake Video Detection Using Convolutional Vision Transformer, [Paper]
(arXiv 2021.04) Deepfake Detection Scheme Based on Vision Transformer and Distillation, [Paper]
(arXiv 2021.04) M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection, [Paper]
(arXiv 2021.07) Combining EfficientNet and Vision Transformers for Video Deepfake Detection, [Paper]
(arXiv 2021.08) Video Transformer for Deepfake Detection with Incremental Learning, [Paper]

Dehazing

(arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]

Detection

(ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]
(ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]
(CVPR'21) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper], [Code]
(arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]
(arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]
(arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]
(arXiv 2020.12) DETR for Pedestrian Detection, [Paper]
(arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]
(arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]
(arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]
(arXiv 2021.03) SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving, [Paper]
(arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]
(arXiv 2021.03) TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization, [Paper]
(arXiv 2021.03) CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, [Paper]
(arXiv 2021.03) DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention, [Paper]
(arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]
(arXiv 2021.04) Points as Queries: Weakly Semi-supervised Object Detection by Points, [Paper]
(arXiv 2021.04) CAT: Cross-Attention Transformer for One-Shot Object Detection, [Paper]
(arXiv 2021.05) Content-Augmented Feature Pyramid Network with Light Linear Transformers, [Paper]
(arXiv 2021.06) You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection, [Paper]
(arXiv 2021.06) DETReg: Unsupervised Pretraining with Region Priors for Object Detection, [Paper],[Project]
(arXiv 2021.06) Oriented Object Detection with Transformer, [Paper]
(arXiv 2021.06) MODETR: Moving Object Detection with Transformers, [Paper]
(arXiv 2021.07) ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer, [Paper]
(arXiv 2021.07) OODformer: Out-Of-Distribution Detection Transformer, [Paper]
(arXiv 2021.07) Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers, [Paper],[Code]
(arXiv 2021.08) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper],[Code]
(arXiv 2021.08) PSViT: Better Vision Transformer via Token Pooling and Attention Sharing, [Paper]
(arXiv 2021.08) Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation), [Paper],[Code]
(arXiv 2021.08) Conditional DETR for Fast Training Convergence, [Paper],[Code]
(arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]
(arXiv 2021.08) TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios, [Paper]
(arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper],[Code]
(arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]
(arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]

Face

(arXiv 2021.03) Face Transformer for Recognition, [Paper]
(arXiv 2021.03) Robust Facial Expression Recognition with Convolutional Visual Transformers, [Paper]
(arXiv 2021.04) TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection, [Paper]
(arXiv 2021.04) Facial Attribute Transformers for Precise and Robust Makeup Transfer, [Paper]
(arXiv 2021.04) Learning to Cluster Faces via Transformer, [Paper]
(arXiv 2021.06) VidFace: A Full-Transformer Solver for Video Face Hallucination with Unaligned Tiny Snapshots, [Paper]
(arXiv 2021.06) MViT: Mask Vision Transformer for Facial Expression Recognition in the wild, [Paper]
(arXiv 2021.06) Shuffle Transformer with Feature Alignment for Video Face Parsing, [Paper]
(arXiv 2021.06) A Latent Transformer for Disentangled and Identity-Preserving Face Editing, [Paper], [Code]
(arXiv 2021.07) ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer, [Paper]
(arXiv 2021.08) FT-TDR: Frequency-guided Transformer and Top-Down Refinement Network for Blind Face Inpainting, [Paper]
(arXiv 2021.08) Learning Fair Face Representation With Progressive Cross Transformer, [Paper]
(arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [Paper]
(arXiv 2021.09) TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network, [Paper]
(arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper],[Code]
(arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper],[Code]
(arXiv 2021.09) MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition, [Paper]

Few-shot Learning

(arXiv 2021.04) Rich Semantics Improve Few-shot Learning, [Paper], [Code]
(arXiv 2021.04) Few-Shot Segmentation via Cycle-Consistent Transformer, [Paper]
(arXiv 2021.09) Rich Semantics Improve Few-shot Learning, [Paper], [Code]

GAN

(arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]
(arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]
(arXiv 2021.04) VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers, [Paper], [Code]
(arXiv 2021.05) Combining Transformer Generators with Convolutional Discriminators, [Paper], [Code]
(arXiv 2021.06) ViT-Inception-GAN for Image Colourising, [Paper]
(arXiv 2021.06) Improved Transformer for High-Resolution GANs, [Paper]
(arXiv 2021.06) Styleformer: Transformer based Generative Adversarial Networks with Style Vector, [Paper], [Code]
(arXiv 2021.07) ViTGAN: Training GANs with Vision Transformers, [Paper]

Gaze

(arXiv 2021.06) Gaze Estimation using Transformer, [Paper], [Code]

HOI

(CVPR'21) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [Paper], [Code]
(arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]
(arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]
(arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]
(arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
(arXiv 2021.08) GTNet:Guided Transformer Network for Detecting Human-Object Interactions, [Paper], [Code]

Hyperspectral

(arXiv 2021.07) SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers, [Paper], [Code]

In-painting

(ECCV'20) Learning Joint Spatial-Temporal Transformations for Video Inpainting, [Paper], [Code]
(arXiv 2021.04) Aggregated Contextual Transformations for High-Resolution Image Inpainting, [Paper], [Code]
(arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [Paper], [Code]

Instance Segmentation

(CVPR'21) End-to-End Video Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.04) ISTR: End-to-End Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.08) SOTR: Segmenting Objects with Transformers, [Paper], [Code]

Layout

(CVPR'21) Variational Transformer Networks for Layout Generation, [Paper]

Matching

(CVPR'21') LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]

Medical

(arXiv 2021.02) TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.02) Medical Transformer: Gated Axial-Attention for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.03) SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation, [Paper], [Code]
(arXiv 2021.03) TransBTS: Multimodal Brain Tumor Segmentation Using Transformer, [Paper], [Code]
(arXiv 2021.03) TransMed: Transformers Advance Multi-modal Medical Image Classification, [Paper]
(arXiv 2021.03) U-Net Transformer: Self and Cross Attention for Medical Image Segmentation, [Paper]
(arXiv 2021.03) SUNETR: Transformers for 3D Medical Image Segmentation, [Paper]
(arXiv 2021.04) DeepProg: A Multi-modal Transformer-based End-to-end Framework for Predicting Disease Prognosis, [Paper]
(arXiv 2021.04) ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration, [Paper], [Code]
(arXiv 2021.04) Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification, [Paper]
(arXiv 2021.04) Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer, [Paper]
(arXiv 2021.04) Medical Transformer: Universal Brain Encoder for 3D MRI Analysis, [Paper]
(arXiv 2021.04) Crossmodal Matching Transformer for Interventional in TEVAR, [Paper]
(arXiv 2021.04) GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification, [Paper]
(arXiv 2021.04) Pyramid Medical Transformer for Medical Image Segmentation, [Paper]
(arXiv 2021.05) Anatomy-Guided Parallel Bottleneck Transformer Network for Automated Evaluation of Root Canal Therapy, [Paper]
(arXiv 2021.05) Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.05) Is Image Size Important? A Robustness Comparison of Deep Learning Methods for Multi-scale Cell Image Classification Tasks: from Convolutional Neural Networks to Visual Transformers, [Paper]
(arXiv 2021.05) Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers, [Paper]
(arXiv 2021.05) Medical Image Segmentation using Squeeze-and-Expansion Transformers, [Paper], [Code]
(arXiv 2021.05) POCFormer: A Lightweight Transformer Architecture for Detection of COVID-19 Using Point of Care Ultrasound, [Paper]
(arXiv 2021.05) COTR: Convolution in Transformer Network for End to End Polyp Detection, [Paper]
(arXiv 2021.05) PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer, [Paper]
(arXiv 2021.06) TED-net: Convolution-free T2T Vision Transformerbased Encoder-decoder Dilation network for Low-dose CT Denoising, [Paper]
(arXiv 2021.06) A Multi-Branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation, [Paper]
(arXiv 2021.06) Task Transformer Network for Joint MRI Reconstruction and Super-Resolution, [Paper], [Code]
(arXiv 2021.06) DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation, [Paper]
(arXiv 2021.06) More than Encoder: Introducing Transformer Decoder to Upsample, [Paper]
(arXiv 2021.06) Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image, [Paper]
(arXiv 2021.06) MTrans: Multi-Modal Transformer for Accelerated MR Imaging, [Paper], [Code]
(arXiv 2021.06) Multi-Compound Transformer for Accurate Biomedical Image Segmentation, [Paper], [Code]
(arXiv 2021.07) ResViT: Residual vision transformers for multi-modal medical image synthesis, [Paper]
(arXiv 2021.07) E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception, [Paper]
(arXiv 2021.07) UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation, [Paper]
(arXiv 2021.07) COVID-VIT: Classification of Covid-19 from CT chest images based on vision transformer models, [Paper]
(arXiv 2021.07) RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting, [Paper], [Code]
(arXiv 2021.07) Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation, [Paper]
(arXiv 2021.07) Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries, [Paper]
(arXiv 2021.07) EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification, [Paper]
(arXiv 2021.07) Visual Transformer with Statistical Test for COVID-19 Classification, [Paper]
(arXiv 2021.07) TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation, [Paper]
(arXiv 2021.07) Few-Shot Domain Adaptation with Polymorphic Transformers, [Paper], [Code]
(arXiv 2021.07) TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation, [Paper]
(arXiv 2021.07) Surgical Instruction Generation with Transformers, [Paper]
(arXiv 2021.07) LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.07) TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee Topology Preservation in Segmentations, [Paper], [Code]
(arXiv 2021.08) Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers, [Paper], [Code]
(arXiv 2021.08) Is it Time to Replace CNNs with Transformers for Medical Images, [Paper], [Code]
(arXiv 2021.09) nnFormer: Interleaved Transformer for Volumetric Segmentation, [Paper], [Code]
(arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
(arXiv 2021.09) MISSFormer: An Effective Medical Image Segmentation Transformer, [Paper]
(arXiv 2021.09) Eformer: Edge Enhancement based Transformer for Medical Image Denoising, [Paper]
(arXiv 2021.09) Transformer-Unet: Raw Image Processing with Unet, [Paper]
(arXiv 2021.09) BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation, [Paper]
(arXiv 2021.09) GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation, [Paper]
(arXiv 2021.10) Transformer Assisted Convolutional Network for Cell Instance Segmentation, [Paper]

Motion

(arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]
(arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper]

Multi-task/modal

(arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Code]
(arXiv 2021.04) MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding, [Paper], [Code]
(arXiv 2021.04) Multi-Modal Fusion Transformer for End-to-End Autonomous Driving, [Paper]
(arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper]
(arXiv 2021.06) Scene Transformer: A Unified Multi-task Model for Behavior Prediction and Planning, [Paper]
(arXiv 2021.06) Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation, [Paper]
(arXiv 2021.06) A Transformer-based Cross-modal Fusion Model with Adversarial Training, [Paper]
(arXiv 2021.07) Attention Bottlenecks for Multimodal Fusion, [Paper]
(arXiv 2021.07) Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots, [Paper]
(arXiv 2021.07) Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions, [Paper]
(arXiv 2021.07) Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers, [Paper], [Code]
(arXiv 2021.08) StrucTexT: Structured Text Understanding with Multi-Modal Transformers, [Paper]
(arXiv 2021.08) Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations, [Paper]
(arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
(arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
(arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
(arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
(arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
(arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]

NAS

(CVPR'21) HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers, [Paper], [Code]
(arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]
(arXiv.2021.03) BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search, [Paper], [Code]
(arXiv.2021.06) Vision Transformer Architecture Search, [Paper], [Code]
(arXiv.2021.07) AutoFormer: Searching Transformers for Visual Recognition, [Paper], [Code]
(arXiv.2021.07) GLiT: Neural Architecture Search for Global and Local Image Transformer, [Paper]
(arXiv.2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper]

(ICLR'21) VTNet: Visual Transformer Network for Object Goal Navigation, [Paper]
(arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]
(arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]
(arXiv 2021.05) Episodic Transformer for Vision-and-Language Navigation, [Paper]

OCR

(arXiv 2021.04) Handwriting Transformers, [Paper]
(arXiv 2021.05) I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition, [Paper]
(arXiv 2021.05) Vision Transformer for Fast and Efficient Scene Text Recognition, [Paper]
(arXiv 2021.06) DocFormer: End-to-End Transformer for Document Understanding, [Paper]
(arXiv 2021.08) A Transformer-based Math Language Model for Handwritten Math Expression Recognition, [Paper]
(arXiv 2021.09) TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, [Paper], [Code]

Panoptic Segmentation

(arXiv.2020.12) MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers, [Paper]
(arXiv 2021.09) Panoptic SegFormer, [Paper]
(arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper],[Code]

Point Cloud

(ICRA'21) NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation, [Paper]
(arXiv 2020.12) Point Transformer, [Paper]
(arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
(arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
(arXiv 2021.03) You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module, [Paper], [Code]
(arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]
(arXiv 2021.04) M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers, [Paper]
(arXiv 2021.04) Dual Transformer for Point Cloud Analysis, [Paper]
(arXiv 2021.04) Point Cloud Learning with Transformer, [Paper]
(arXiv 2021.08) SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer, [Paper], [Code]
(arXiv 2021.08) PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds, [Paper], [Code]
(arXiv 2021.08) Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning, [Paper], [Code]
(arXiv 2021.08) PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers, [Paper], [Code]
(arXiv 2021.08) Improving 3D Object Detection with Channel-wise Transformer, [Paper], [Code]
(arXiv 2021.09) PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds, [Paper], [Code]
(arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper]

Pose

(arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
(arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
(arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]
(arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]
(arXiv 2021.03) Lifting Transformer for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2021.03) TFPose: Direct Human Pose Estimation with Transformers, [Paper]
(arXiv 2021.04) Pose Recognition with Cascade Transformers, [Paper], [Code]
(arXiv 2021.04) TokenPose: Learning Keypoint Tokens for Human Pose Estimation, [Paper]
(arXiv 2021.04) Skeletor: Skeletal Transformers for Robust Body-Pose Estimation, [Paper]
(arXiv 2021.04) HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction, [Paper]
(arXiv 2021.07) Test-Time Personalization with a Transformer for Human Pose Estimation, [Paper]
(arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
(arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
(arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]

Pruning

(arXiv 2021.04) Visual Transformer Pruning, [Paper]

Recognition

(arXiv 2021.03) Global Self-Attention Networks for Image Recognition, [Paper]
(arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]
(arXiv 2021.05) Are Convolutional Neural Networks or Transformers more like human vision, [Paper]
(arXiv 2021.07) Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition, [Paper]
(arXiv 2021.07) RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition, [Paper]
(arXiv 2021.08) DPT: Deformable Patch-based Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.10) A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition, [Paper]

Reconstruction

(arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]
(arXiv 2021.06) THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers, [Paper]
(arXiv 2021.06) LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction, [Paper]
(arXiv 2021.07) TransformerFusion: Monocular RGB Scene Reconstruction using Transformers, [Paper]

Re-identification

(arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
(arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [Paper]
(arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]
(arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]
(arXiv 2021.06) Transformer-Based Deep Image Matching for Generalizable Person Re-identification, [Paper]
(arXiv 2021.06) Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer, [Paper]
(arXiv 2021.06) Person Re-Identification with a Locally Aware Transformer, [Paper]
(arXiv 2021.07) Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification, [Paper], [Code]
(arXiv 2021.07) GiT: Graph Interactive Transformer for Vehicle Re-identification, [Paper]
(arXiv 2021.07) HAT: Hierarchical Aggregation Transformers for Person Re-identification, [Paper]
(arXiv 2021.09) Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification, [Paper]
(arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]

Restoration

(arXiv 2021.06) Uformer: A General U-Shaped Transformer for Image Restoration, [Paper], [Code]
(arXiv 2021.08) SwinIR: Image Restoration Using Swin Transformer, [Paper], [Code]

Retrieval

(CVPR'21') Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, [Paper]
(arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
(arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
(arXiv 2021.03) Instance-level Image Retrieval using Reranking Transformers, [Paper]
(arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]
(arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [Paper]
(arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper], [Code]
(arXiv 2021.06) Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features, [Paper]
(arXiv 2021.06) All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers, [Paper], [Code]
(arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]

Salient Object Detection

(arXiv 2021.04) Transformer Transforms Salient Object Detection and Camouflaged Object Detection, [Paper]
(arXiv 2021.04) Visual Saliency Transformer, [Paper]
(arXiv 2021.04) CoSformer: Detecting Co-Salient Object with Transformers, [Paper]
(arXiv 2021.08) Unifying Global-Local Representations in Salient Object Detection with Transformer, [Paper], [Code]
(arXiv 2021.08) TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network, [Paper], [Code]
(arXiv 2021.08) Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net, [Paper]

Scene

(arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
(arXiv 2021.05) SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation, [Paper]
(arXiv 2021.06) P2T: Pyramid Pooling Transformer for Scene Understanding, [Paper], [Code]
(arXiv 2021.07) Scenes and Surroundings: Scene Graph Generation using Relation Transformer, [Paper]
(arXiv 2021.07) Spatial-Temporal Transformer for Dynamic Scene Graph Generation, [Paper]
(arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]

Self-supervised Learning

(arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper], [Code]
(arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]
(arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]], [Code]
(arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper], [Code]
(arXiv 2021.04) Emerging Properties in Self-Supervised Vision Transformers, [Paper], [Code]
(arXiv 2021.05) Self-Supervised Learning with Swin Transformers, [Paper], [Code]
(arXiv 2021.06) MST: Masked Self-Supervised Transformer for Visual Representation, [Paper]
(arXiv 2021.06) Efficient Self-supervised Vision Transformers for Representation Learning, [Paper]
(arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper]

Semantic Segmentation

(arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
(arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]
(arXiv 2021.05) Segmenter: Transformer for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.06) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.06) Fully Transformer Networks for Semantic Image Segmentation, [Paper]
(arXiv 2021.06) Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images, [Paper]
(arXiv 2021.06) OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments, [Paper]
(arXiv 2021.07) Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images, [Paper]
(arXiv 2021.07) Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World, [Paper]
(arXiv 2021.07) A Unified Efficient Pyramid Transformer for Semantic Segmentation, [Paper]
(arXiv 2021.08) Boosting Few-shot Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.08) Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer, [Paper], [Code]
(arXiv 2021.08) Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation, [Paper], [Code]
(arXiv 2021.08) Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance, [Paper], [Code]
(arXiv 2021.08) Evaluating Transformer based Semantic Segmentation Networks for Pathological Image Segmentation, [Paper]
(arXiv 2021.08) Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models, [Paper]
(arXiv 2021.09) Efficient Hybrid Transformer: Learning Global-local Context for Urban Sence Segmentation, [Paper]

Shape

(WACV'21) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]

Super-Resolution

(CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]
(arXiv 2021.06) LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation, [Paper]
(arXiv 2021.06) Video Super-Resolution Transformer, [Paper], [Code]
(arXiv 2021.08) Light Field Image Super-Resolution with Transformers, [Paper], [Code]
(arXiv 2021.08) Efficient Transformer for Single Image Super-Resolution, [Paper]
(arXiv 2021.09) Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution, [Paper]

Synthesis

(arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]
(arXiv 2021.04) Geometry-Free View Synthesis: Transformers and no 3D Priors, [Paper]
(arXiv 2021.05) High-Resolution Complex Scene Synthesis with Transformers, [Paper]
(arXiv 2021.06) The Image Local Autoregressive Transformer, [Paper]

Tracking

(EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]
(CVPR'21) Transformer Tracking, [Paper], [Code]
(CVPR'21) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]
(arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]
(arXiv 2021.03) TransCenter: Transformers with Dense Queries for Multiple-Object Tracking, [Paper]
(arXiv 2021.03) Learning Spatio-Temporal Transformer for Visual Tracking, [Paper], [Code]
(arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]
(arXiv 2021.04) Spatial-Temporal Graph Transformer for Multiple Object Tracking, [Paper]
(arXiv 2021.05) MOTR: End-to-End Multiple-Object Tracking with TRansformer, [Paper], [Code]
(arXiv 2021.05) TrTr: Visual Tracking with Transformer, [Paper], [Code]
(arXiv 2021.08) HiFT: Hierarchical Feature Transformer for Aerial Tracking, [Paper], [Code]

Texture

(arXiv 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]

Transfer learning

(arXiv 2021.06) Transformer-Based Source-Free Domain Adaptation, [Paper], [Code]

Video

(ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]
(ICLR'21) Support-set bottlenecks for video-text representation learning, [Paper]
(arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation, [Paper]
(arXiv 2021.02) Video Transformer Network, [Paper]
(arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]
(arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]
(arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]
(arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]
(arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]
(arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]
(arXiv 2021.03) ViViT: A Video Vision Transformer, [paper]
(arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]
(arXiv 2021.04) Temporal Query Networks for Fine-grained Video Understanding, [Paper], [Project]
(arXiv 2021.04) Higher Order Recurrent Space-Time Transformer, [Paper], [Code]
(arXiv 2021.04) VideoGPT: Video Generation using VQ-VAE and Transformers, [Paper], [Code]
(arXiv 2021.04) VidTr: Video Transformer Without Convolutions, [Paper]
(arXiv 2021.05) Local Frequency Domain Transformer Networks for Video Prediction, [Paper]
(arXiv 2021.05) End-to-End Video Object Detection with Spatial-Temporal Transformers, [Paper], [Code]
(arXiv 2021.06) Anticipative Video Transformer, [Paper], [Project]
(arXiv 2021.06) TransVOS: Video Object Segmentation with Transformers, [Paper]
(arXiv 2021.06) Associating Objects with Transformers for Video Object Segmentation, [Paper]
(arXiv 2021.06) Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers, [Paper]
(arXiv 2021.06) Space-time Mixing Attention for Video Transformer, [Paper]
(arXiv 2021.06) Video Instance Segmentation using Inter-Frame Communication Transformers, [Paper]
(arXiv 2021.06) Long-Short Temporal Contrastive Learning of Video Transformers, [Paper]
(arXiv 2021.06) Video Swin Transformer, [Paper], [Code]
(arXiv 2021.06) Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection, [Paper]
(arXiv 2021.07) Ultrasound Video Transformers for Cardiac Ejection Fraction Estimation, [Paper], [Code]
(arXiv 2021.07) Generative Video Transformer: Can Objects be the Words, [Paper]
(arXiv 2021.07) Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection, [Paper]
(arXiv 2021.08) Token Shift Transformer for Video Classification, [Paper], [Code]
(arXiv 2021.08) Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering, [Paper]
(arXiv 2021.08) Video Relation Detection via Tracklet based Visual Transformer, [Paper], [Code]
(arXiv 2021.08) MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition, [Paper]
(arXiv 2021.08) ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos, [Paper]
(arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]
(arXiv 2021.09) Hierarchical Multimodal Transformer to Summarize Videos, [Paper]

Visual Grounding

(arXiv 2021.04) TransVG: End-to-End Visual Grounding with Transformers, [Paper]
(arXiv 2021.05) Visual Grounding with Transformers, [Paper]
(arXiv 2021.06) Referring Transformer: A One-step Approach to Multi-task Visual Grounding, [Paper]
(arXiv 2021.08) Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding, [Paper]
(arXiv 2021.08) TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding, [Paper]
(arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]

Visual Relationship Detection

(arXiv 2021.04) RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory, [Paper]
(arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
(arXiv 2021.08) Discovering Spatial Relationships by Transformers for Domain Generalization, [Paper]

Voxel

(arXiv 2021.05) SVT-Net: A Super Light-Weight Network for Large Scale Place Recognition using Sparse Voxel Transformers, [Paper]
(arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]

Zero-Shot Learning

(arXiv 2021.08) Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning, [Paper]

Others

(CVPR'21') Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]
(CVPR'21') Pre-Trained Image Processing Transformer, [Paper]
(ICCV'21) PlaneTR: Structure-Guided Transformers for 3D Plane Recovery, [Paper], [Code]
(arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Code]
(arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]
(arXiv 2021.01) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]
(arXiv 2021.04) Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning, [Paper]
(arXiv 2021.04) Cloth Interactive Transformer for Virtual Try-On, [Paper], [Code]
(arXiv 2021.04) Fourier Image Transformer, [Paper], [Code]
(arXiv 2021.05) Novelty Detection and Analysis of Traffic Scenario Infrastructures in the Latent Space of a Vision Transformer-Based Triplet Autoencoder, [Paper]
(arXiv 2021.05) Attention for Image Registration (AiR): an unsupervised Transformer approach, [Paper]
(arXiv 2021.05) IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture, [Paper]
(arXiv 2021.05) CogView: Mastering Text-to-Image Generation via Transformers, [Paper]
(arXiv 2021.06) A Comparison for Anti-noise Robustness of Deep Learning Classification Methods on a Tiny Object Image Dataset: from Convolutional Neural Network to Visual Transformer and Performer, [Paper]
(arXiv 2021.06) Predicting Vehicles Trajectories in Urban Scenarios with Transformer Networks and Augmented Information, [Paper]
(arXiv 2021.06) StyTr2: Unbiased Image Style Transfer with Transformers, [Paper]
(arXiv 2021.06) Semantic Correspondence with Transformers, [Paper]
(arXiv 2021.06) Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue, [Paper]
(arXiv 2021.07) Grid Partitioned Attention: Efficient Transformer Approximation with Inductive Bias for High Resolution Detail Generation, [Paper], [Code]
(arXiv 2021.07) Image Fusion Transformer, [Paper], [Code]
(arXiv 2021.07) PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution, [Paper]
(arXiv 2021.07) PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion, [Paper]
(arXiv 2021.08) Applications of Artificial Neural Networks in Microorganism Image Analysis: A Comprehensive Review from Conventional Multilayer Perceptron to Popular Convolutional Neural Network and Potential Visual Transformer, [Paper]
(arXiv 2021.08) Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, [Paper], [Code]
(arXiv 2021.08) The Right to Talk: An Audio-Visual Transformer Approach, [Paper], [Code]
(arXiv 2021.08) Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion, [Paper], [Code]
(arXiv 2021.08) Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper], [Code]
(arXiv 2021.08) TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.08) Investigating transformers in the decomposition of polygonal shapes as point collections, [Paper]
(arXiv 2021.08) Convolutional Neural Network (CNN) vs Visual Transformer (ViT) for Digital Holography, [Paper]
(arXiv 2021.08) Construction material classification on imbalanced datasets for construction monitoring automation using Vision Transformer (ViT) architecture, [Paper]
(arXiv 2021.08) Spatial Transformer Networks for Curriculum Learning, [Paper]
(arXiv 2021.09) TransforMesh: A Transformer Network for Longitudinal modeling of Anatomical Meshes, [Paper]
(arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]
(arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]
(arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]
(arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper], [Code]
(arXiv 2021.10) ProTo: Program-Guided Transformer for Program-Guided Tasks, [Paper]

Contact & Feedback

If you have any suggestions about this project, feel free to contact me.

[e-mail: yzhangcst[at]gmail.com]

About

A paper list of some recent Transformer-based CV works.