love1005lin / Transformer-in-Vision

Recent Transformer-based CV and related works.

Transformer-in-Vision

Recent Transformer-based CV and related works. Welcome to comment/contribute!

Keep updated.

Resource

SCENIC: A JAX Library for Computer Vision Research and Beyond, [Code]
V-L joint learning study (with good tables): [METER], [Kaleido-BERT]
Attention is all you need, [Paper]
OpenAI CLIP [Page], [Paper], [Code], [arXiv]
OpenAI DALL·E [Page], [Code], [Paper]
huggingface/transformers
Kyubyong/transformer, TF
jadore801120/attention-is-all-you-need-pytorch, Torch
krasserm/fairseq-image-captioning
PyTorch Transformers Tutorials
ictnlp/awesome-transformer
basicv8vc/awesome-transformer
dk-liang/Awesome-Visual-Transformer
yuewang-cuhk/awesome-vision-language-pretraining-papers

Survey

(arXiv 2022.02) Transformer for Graphs: An Overview from Architecture Perspective, [Paper]
(arXiv 2022.01) Video Transformers: A Survey, [Paper]
(arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP MLP, [Paper]
(arXiv 2021.11) A Survey of Visual Transformers, [Paper]
(arXiv 2021.09) Survey: Transformer based Video-Language Pre-training, [Paper]
(arXiv 2021.06) A Survey of Transformers, [Paper]
(arXiv 2021.06) Attention mechanisms and deep learning for machine vision: A survey of the state of the art, [Paper]
(arXiv 2021.06) Pre-Trained Models: Past, Present and Future, [Paper]
(arXiv 2021.05) Can Attention Enable MLPs To Catch Up With CNNs? [Paper]
(arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]
(arXiv 2021.01) A Survey on Visual Transformer, [Paper]
(arXiv 2020.9) Efficient Transformers: A Survey, [Paper]
(arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

(arXiv 2022.02) ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer, [Paper]
(arXiv 2022.02) Hyper-relationship Learning Network for Scene Graph Generation, [Paper]
(arXiv 2022.02) CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval, [Paper]
(arXiv 2022.02) Flowformer: Linearizing Transformers with Conservation Flows, [Paper]
(arXiv 2022.02) DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following, [Paper], [Code]
(arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [Paper]
(arXiv 2022.02) Geometric Transformer for Fast and Robust Point Cloud Registration, [Paper], [Code]
(arXiv 2022.02) I-Tuning: Tuning Language Models with Image for Caption Generation, [[Paper]](I-Tuning: Tuning Language Models with Image for Caption Generation)
(arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval, [Paper], [Code]
(arXiv 2022.02) Visual Acoustic Matching, [Paper]
(arXiv 2022.02) LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling, [Paper]
(arXiv 2022.02) BViT: Broad Attention based Vision Transformer, [Paper], [Code]
(arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [Paper]
(arXiv 2022.02) Domain Adaptation via Prompt Learning, [Paper]
(arXiv 2022.02) Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs, [Paper], [Code]
(arXiv 2022.02) Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework, [Paper], [Project]
(arXiv 2022.02) HOW DO VISION TRANSFORMERS WORK? [Paper], [Code]
(arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning, [Paper], [Code]
(arXiv 2022.02) CLIPasso: Semantically-Aware Object Sketching, [Paper], [Code]
(arXiv 2022.02) Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer, [Paper]
(arXiv 2022.02) DEEP SOCCER CAPTIONING WITH TRANSFORMER: DATASET, SEMANTICS-RELATED LOSSES, AND MULTI-LEVEL EVALUATION, [Paper], [Project]
(arXiv 2022.02) ENTROFORMER: A TRANSFORMER-BASED ENTROPY MODEL FOR LEARNED IMAGE COMPRESSION, [Paper], [Code]
(arXiv 2022.02) Image Difference Captioning with Pre-training and Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) MaskGIT: Masked Generative Image Transformer, [Paper]
(arXiv 2022.02) Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning, [Paper]
(arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [Paper]
(arXiv 2022.02) Conditional Motion In-betweening, [Paper], [Code]
(arXiv 2022.02) Memory-based gaze prediction in deep imitation learning for robot manipulation, [Paper]
(arXiv 2022.02) Spherical Transformer, [Paper]
(arXiv 2022.02) OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context, [Paper]
(arXiv 2022.02) The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning, [Paper], [Project]
(arXiv 2022.02) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer, [Paper]
(arXiv 2022.02) The devil is in the labels: Semantic segmentation from sentences, [Paper]
(arXiv 2022.02) Webly Supervised Concept Expansion for General Purpose Vision Models, [Paper], [Project]
(arXiv 2022.02) VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG, [Paper]
(arXiv 2022.02) UNIFYING ARCHITECTURES, TASKS, AND MODALITIES THROUGH A SIMPLE SEQUENCE-TO-SEQUENCE LEARNING FRAMEWORK, [Paper], [Code]
(arXiv 2022.02) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper]
(arXiv 2022.02) TRANSDREAMER: REINFORCEMENT LEARNING WITH TRANSFORMER WORLD MODELS, [Paper]
(arXiv 2022.02) Vision-Language Pre-Training with Triple Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) Corrupted Image Modeling for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2022.02) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, [Paper], [Code]
(arXiv 2022.02) DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators, [Paper]
(arXiv 2022.02) Interactron: Embodied Adaptive Object Detection, [Paper]
(arXiv 2022.02) Local Feature Matching with Transformers for low-end devices LoFTR method adaptation approach, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) Can Transformers be Strong Treatment Effect Estimators?, [Paper]
(arXiv 2022.02) Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers, [Paper]
(arXiv 2022.02) Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics, [Paper], [Code]

2022.01

(arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [Paper]
(arXiv 2022.01) DynaMixer: A Vision MLP Architecture with Dynamic Mixing, [Paper]
(arXiv 2022.01) VRT: A Video Restoration Transformer, [Paper], [Code]
(arXiv 2022.01) DAB-DETR: DYNAMIC ANCHOR BOXES ARE BETTER QUERIES FOR DETR, [Paper], [Code]
(arXiv 2022.01) Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations, [Paper]
(arXiv 2022.01) MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment, [Paper]
(arXiv 2022.01) VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training, [Paper]
(arXiv 2022.01) BOAT: Bilateral Local Attention Vision Transformer, [Paper]
(arXiv 2022.01) GRAPH SELF-ATTENTION FOR LEARNING GRAPH REPRESENTATION WITH TRANSFORMER, [Paper]
(arXiv 2022.01) Aggregating Global Features into Local Vision Transformer, [Paper], [Code]
(arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [Paper]
(arXiv 2022.01) Generalised Image Outpainting with U-Transformer, [Paper]
(arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper]
(arXiv 2022.01) DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer, [Paper]
(arXiv 2022.01) Pre-Trained Language Transformers are Universal Image Classifiers, [Paper]
(arXiv 2022.01) Explore and Match: End-to-End Video Grounding with Transformer, [Paper]
(arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [Paper]
(arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals, [Paper]
(arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation, [Paper], [Project]
(arXiv 2022.01) CONVOLUTIONAL XFORMERS FOR VISION, [Paper], [Code]
(arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [Paper], [Code]
(arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]
(arXiv 2022.01) SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering, [Paper]
(arXiv 2022.01) DUAL-TASKS SIAMESE TRANSFORMER FRAMEWORK FOR BUILDING DAMAGE ASSESSMENT, [Paper]
(arXiv 2022.01) When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism, [Paper], [Code]
(arXiv 2022.01) Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation, [Paper]
(arXiv 2022.01) Training Vision Transformers with Only 2040 Images, [Paper]
(arXiv 2022.01) Learning To Recognize Procedural Activities with Distant Supervision, [Paper]
(arXiv 2022.01) EVALUATING LANGUAGE-BIASED IMAGE CLASSIFICATION BASED ON SEMANTIC REPRESENTATIONS, [Paper]
(arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks, [Paper]
(arXiv 2022.01) UniFormer: Unifying Convolution and Self-attention for Visual Recognition, [Paper], [Code]
(arXiv 2022.01) Patches Are All You Need? [Paper], [Code]
(arXiv 2022.01) Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval, [Paper]
(arXiv 2022.01) LEARNING TO ACT WITH AFFORDANCE-AWARE MULTIMODAL NEURAL SLAM, [Paper]
(arXiv 2022.01) Visual Information Guided Zero-Shot Paraphrase Generation, [Paper]
(arXiv 2022.01) TerViT: An Efficient Ternary Vision Transformer, [Paper]
(arXiv 2022.01) End-to-end Generative Pretraining for Multimodal Video Captioning, [Paper]
(arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper], [Project]
(arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, [Paper]
(arXiv 2022.01) The CLEAR Benchmark: Continual LEArning on Real-World Imagery, [Paper], [Project]
(arXiv 2022.01) ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues, [Paper]
(arXiv 2022.01) Cross-modal Contrastive Distillation for Instructional Activity Anticipation, [Paper]
(arXiv 2022.01) Transformers in Action: Weakly Supervised Action Segmentation, [Paper]
(arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer, [Paper]
(arXiv 2022.01) CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks, [Paper]
(arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]
(arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference, [Paper]
(arXiv 2022.01) Motion Inbetweening via Deep ∆-Interpolator, [Paper]
(arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper]
(arXiv 2022.01) GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events, [Paper]
(arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]
(arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]
(arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation, [Paper], [Project]
(arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers*, [Paper]
(arXiv 2022.01) SWINUNET3D - A HIERARCHICAL ARCHITECTURE FOR DEEP TRAFFIC PREDICTION USING SHIFTED WINDOW TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) SWIN-POSE: SWIN TRANSFORMER BASED HUMAN POSE ESTIMATION, [Paper]
(arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [Paper], [Project]
(arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving Hashing, [Paper]
(arXiv 2022.01) LANGUAGE-DRIVEN SEMANTIC SEGMENTATION, [Paper], [Code]
(arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]
(arXiv 2022.01) ImageSubject: A Large-scale Dataset for Subject Detection, [Paper]
(arXiv 2022.01) Detecting Twenty-thousand Classes using Image-level Supervision, [Paper], [Code]
(arXiv 2022.01) Generalized Category Discovery, [Paper], [Code]
(arXiv 2022.01) Video Summarization Based on Video-text Modelling, [Paper]
(arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2022.01) QUADTREE ATTENTION FOR VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval, [Paper], [Project]
(arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]
(arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]
(arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]
(arXiv 2022.01) Multiview Transformers for Video Recognition, [Paper]
(arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED FEW-SHOT LEARNING, [Paper]
(arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT SPATIOTEMPORAL REPRESENTATION LEARNING, [Paper], [Code]
(arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper], [Project]
(arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers, [Paper]
(arXiv 2022.01) CLIP-Event: Connecting Text and Images with Event Structures, [Paper], [Code]
(arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training, [Paper]
(arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]
(arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]
(arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [Paper]
(arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]
(arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring, [Paper]
(arXiv 2022.01) Stochastic Layers in Vision Transformers, [Paper]
(arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR BIDIRECTIONAL VISION-LANGUAGE GENERATION, [Paper]
(arXiv 2022.01) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer, [Paper], [Code]
(arXiv 2022.01) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]
(arXiv 2022.01) Persformer: A Transformer Architecture for Topological Machine Learning, [Paper]
(arXiv 2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]
(arXiv 2022.01) Language as Queries for Referring Video Object Segmentation, [Paper], [Code]
(arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]
(arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR CHANGE DETECTION, [Paper], [Code]
(arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]
(arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer, [Paper], [Project]
(arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

2021.12

(arXiv 2021.12) Multi-Dimensional Model Compression of Vision Transformer, [Paper]
(arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]
(arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Atention, [Paper], [Code]
(arXiv 2021.12) APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers, [Paper]
(arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper]
(arXiv 2021.12) Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?, [Paper]
(arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]
(arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [Paper]
(arXiv 2021.12) StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2, [Paper], [Code]
(arXiv 2021.12) A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, [Paper], [Code]
(arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]
(arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH SLIDING WINDOWS, [Paper], [Code]
(arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]
(arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]
(arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]
(arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]
(arXiv 2021.12) ViR: the Vision Reservoir, [Paper]
(arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.12) Open-Vocabulary Image Segmentation, [Paper]
(arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]
(arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]
(arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]
(arXiv 2021.12) Fine-grained Multi-Modal Self-Supervised Learning, [Paper]
(arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes, [Paper]
(arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input Adaptation, [Paper]
(arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.12) Contrastive Object Detection Using Knowledge Graph Embeddings, [Paper]
(arXiv 2021.12) RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality, [Paper], [Code]
(arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]
(arXiv 2021.12) MPViT : Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]
(arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]
(arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]
(arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]
(arXiv 2021.12) Tell me what you see: A zero-shot action recognition method based on natural language descriptions, [Paper], [Code]
(arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]
(arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [Paper]
(arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [Paper]
(arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]
(arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]
(arXiv 2021.12) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, [Paper], [Code]
(arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]
(arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]
(arXiv 2021.12) Align and Prompt: Video-and-Language Pre-training with Entity Prompts, [Paper], [Code]
(arXiv 2021.12) DATA EFFICIENT LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION WITH OPTIMAL TRANSPORT DISTILLATION, [Paper]
(arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]
(arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]
(arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources, [Paper]
(arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper]
(arXiv 2021.12) How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [Paper]
(arXiv 2021.12) Learning to Prompt for Continual Learning, [Paper], [Code]
(arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper], [Code]
(arXiv 2021.12) Dense Video Captioning Using Unsupervised Semantic Information, [Paper], [Code]
(arXiv 2021.12) Looking Outside the Box to Ground Language in 3D Scenes, [Paper], [Code]
(arXiv 2021.12) RegionCLIP: Region-based Language-Image Pretraining, [Paper], [Code]
(arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper]
(arXiv 2021.12) Masked Feature Prediction for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning, [Paper]
(arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos, [Paper], [Code]
(arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]
(arXiv 2021.12) QAHOI: Query-Based Anchors for Human-Object Interaction Detection, [Paper], [Code]
(arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]
(arXiv 2021.12) CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations, [Paper]
(arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]
(arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors, [Paper], [Project]
(arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]
(arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]
(arXiv 2021.12) COMPOSER: Compositional Learning of Group Activity in Videos, [Paper]
(arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]
(arXiv 2021.12) Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection, [Paper]
(arXiv 2021.12) SVIP: Sequence VerIfication for Procedures in Videos, [Paper]
(arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]
(arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]
(arXiv 2021.12) PartGlot: Learning Shape Part Segmentation from Language Reference Games, [Paper]
(arXiv 2021.12) Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network, [Paper]
(arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR TEXT-BASED PERSON SEARCH, [Paper]
(arXiv 2021.12) L-Verse: Bidirectional Generation Between Image and Text, [Paper]
(arXiv 2021.12) SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY, [Paper]
(arXiv 2021.12) Are Vision Transformers Robust to Patch Perturbations? [Paper]
(arXiv 2021.12) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]
(arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]
(arXiv 2021.12) MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning, [Paper]
(arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]
(arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]
(arXiv 2021.12) Rethinking the Two-Stage Framework for Grounded Situation Recognition, [Paper], [Code]
(arXiv 2021.12) CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions, [Paper]
(arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling Attention Map, [Paper]
(arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]
(arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]
(arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training, [Paper], [Code]
(arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]
(arXiv 2021.12) Grounded Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper]
(arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR POINT CLOUD ANALYSIS, [Paper]
(arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]
(arXiv 2021.12) VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts, [Paper]
(arXiv 2021.12) PointCLIP: Point Cloud Understanding by CLIP, [Paper], [Code]
(arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]
(arXiv 2021.12) DYNAMIC TOKEN NORMALIZATION IMPROVES VISION TRANSFORMER, [Paper], [Code]
(arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]
(arXiv 2021.12) Text2Mesh: Text-Driven Neural Stylization for Meshes, [Paper], [Project]
(arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]
(arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]
(arXiv 2021.12) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization, [Paper], [Code]
(arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer, [Paper], [Code]
(arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection, [Paper]
(arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper]
(arXiv 2021.12) Transformer based trajectory prediction, [Paper]
(arXiv 2021.12) Evaluating Transformers for Lightweight Action Recognition, [Paper]
(arXiv 2021.12) Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision, [Paper]
(arXiv 2021.12) CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification, [Paper]
(arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]
(arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]
(arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers, [Paper]
(arXiv 2021.12) Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning, [Paper]
(arXiv 2021.12) AUDIO-VISUAL SYNCHRONISATION IN THE WILD, [Paper], [Project]
(arXiv 2021.12) Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, [Paper]
(arXiv 2021.12) Garment4D: Garment Reconstruction from Point Cloud Sequences, [Paper], [Code]
(arXiv 2021.12) Locally Shifted Attention**** With Early Global Integration, [Paper], [Code]
(arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]
(arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Project]
(arXiv 2021.12) HairCLIP: Design Your Hair by Text and Reference Image, [Paper], [Project]
(arXiv 2021.12) CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, [Paper], [Code]
(arXiv 2021.12) A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code], [Dataset]
(arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper], [Code]
(arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]
(arXiv 2021.12) Fast Point Transformer, [Paper]
(arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to Collect Robotic Task Demonstrations, [Paper], [Project]
(arXiv 2021.12) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper]
(arXiv 2021.12) PatchFormer: An Efficient Point Transformer with Patch Attention, [Paper]
(arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]
(arXiv 2021.12) MLP Architectures for Vision-and-Language Modeling: An Empirical Study, [Paper], [Code]
(arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for Video Retrieval, [Paper]
(arXiv 2021.12) Prompting Visual-Language Models for Efficient Video Understanding, [Paper], [Project]
(arXiv 2021.12) FLAVA: A Foundational Language And Vision Alignment Model, [Paper]
(arXiv 2021.12) Embedding Arithmetic for Text-driven Image Transformation, [Paper]
(arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper]
(arXiv 2021.12) Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos, [Paper], [Project]
(arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]
(arXiv 2021.12) DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, [Paper], [Code]
(arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]
(arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]
(arXiv 2021.12) Zero-Shot Text-Guided Object Generation with Dream Fields, [Paper], [Project]
(arXiv 2021.12) Video-Text Pre-training with Learned Regions, [Paper], [Code]
(arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]
(arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning, [Paper]
(arXiv 2021.12) DenseCLIP: Extract Free Dense Labels from CLIP, [Paper]
(arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper]
(arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]
(arXiv 2021.12) Object-Centric Unsupervised Image Captioning, [Paper]
(arXiv 2021.12) Vision Pair Learning: An Efficient Training Framework for Image Classification, [Paper]
(arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]
(arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]
(arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]
(arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
(arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]
(arXiv 2021.12) Human-Object Interaction Detection via Weak Supervision, [Paper]
(arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]
(arXiv 2021.12) CLIPstyler: Image Style Transfer with a Single Text Condition, [Paper]
(arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]
(arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]
(arXiv 2021.12) Object-aware Video-language Pre-training for Retrieval, [Paper], [Code]

2021.11

(arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]
(arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, [Paper]
(arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]
(arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]
(arXiv 2021.11) ADAPTIVE FOURIER NEURAL OPERATORS: EFFICIENT TOKEN MIXERS FOR TRANSFORMERS, [Paper]
(arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]
(arXiv 2021.11) DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, [Paper], [Code]
(arXiv 2021.11) Ice hockey player identification via transformers, [Paper]
(arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]
(arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]
(arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]
(arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]
(arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER ROBUSTNESS, [Paper]
(arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning, [Paper]
(arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling, [Paper]
(arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]
(arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]
(arXiv 2021.11) ZERO-SHOT CERTIFIED DEFENSE AGAINST ADVERSARIAL PATCHES WITH VISION TRANSFORMERS, [Paper]
(arXiv 2021.11) PointMixer: MLP-Mixer for Point Cloud Understanding, [Paper]
(arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]
(arXiv 2021.11) Florence: A New Foundation Model for Computer Vision, [Paper]
(arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]
(arXiv 2021.11) Learning to Compose Visual Relations, [Paper], [Project]
(arXiv 2021.11) REFERENCE-BASED MAGNETIC RESONANCE IMAGE RECONSTRUCTION USING TEXTURE TRANSFORMER, [Paper]
(arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval, [Paper]
(arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]
(arXiv 2021.11) SimMIM: A Simple Framework for Masked Image Modeling, [Paper], [Code]
(arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) ClipCap: CLIP Prefix for Image Captioning, [Paper], [Code]
(arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]
(arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]
(arXiv 2021.11) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, [Paper], [Code]
(arXiv 2021.11) Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning, [Paper], [Code]
(arXiv 2021.11) Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement, [Paper], [Code]
(arXiv 2021.11) Tracking People with 3D Representations, [Paper], [Code]
(arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Paper]
(arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]
(arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code]
(arXiv 2021.11) Attention Approximates Sparse Distributed Memory, [Paper]
(arXiv 2021.11) SLICED RECURSIVE TRANSFORMER, [Paper], [Code]
(arXiv 2021.11) HYBRID BYOL-VIT: EFFICIENT APPROACH TO DEAL WITH SMALL DATASETS, [Paper]
(arXiv 2021.11) Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [Paper], [Code]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis, [Paper], [Code]
(arXiv 2021.11) Revisiting spatio-temporal layouts for compositional action recognition, [Paper], [Code]
(arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in Referential Games, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]
(arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, [Paper], [Code]
(arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning, [Paper], [Project]
(arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]
(arXiv 2021.11) VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval, [Paper]
(arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]
(arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]
(arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]
(arXiv 2021.11) ML-Decoder: Scalable and Versatile Classification Head, [Paper], [Code]
(arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]
(arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization, [Paper]
(arXiv 2021.11) Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation, [Paper]
(arXiv 2021.11) Sparse is Enough in Scaling Transformers, [Paper]
(arXiv 2021.11) An implementation of the “Guess who?” game using CLIP, [Paper], [Code]
(arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]
(arXiv 2021.11) A Unified Pruning Framework for Vision Transformers, [Paper]
(arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]
(arXiv 2021.11) AssistSR: Affordance-centric Question-driven Video Segment Retrieval, [Paper], [Code & Data]
(arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation, [Paper], [Code]
(arXiv 2021.11) , [Paper]
(arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]
(arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]
(arXiv 2021.11) CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning, [Paper]
(arXiv 2021.11) CRIS: CLIP-Driven Referring Image Segmentation, [Paper]
(arXiv 2021.11) Shunted Self-Attention via Multi-Scale Token Aggregation, [Paper], [Code]
(arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning, [Paper]
(arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions, [Paper], [Code]
(arXiv 2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]
(arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]
(arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]
(arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]
(arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper], [Code]
(arXiv 2021.11) LAFITE : Towards Language-Free Training for Text-to-Image Generation, [Paper]
(arXiv 2021.11) SPARSE DETR: EFFICIENT END-TO-END OBJECT DETECTION WITH LEARNABLE SPARSITY, [Paper], [Code]
(arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]
(arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]
(arXiv 2021.11) Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic, [Paper], [Code]
(arXiv 2021.11) Blended Diffusion for Text-driven Editing of Natural Images, [Paper], [Code]
(arXiv 2021.11) Mask Transfiner for High-Quality Instance Segmentation, [Paper], [Code]
(arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]
(arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper], [COde]
(arXiv 2021.11) Towards Tokenized Human Dynamics Representation, [Paper], [Code]
(arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]
(arXiv 2021.11) VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]
(arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]
(arXiv 2021.11) MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video, [Paper]
(arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]
(arXiv 2021.11) Hierarchical Modular Network for Video Captioning, [Paper]
(arXiv 2021.11) NU¨WA: Visual Synthesis Pre-training for Neural visUal World creAtion, [Paper], [Code]
(arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision MLP, [Paper]
(arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper]
(arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]
(arXiv 2021.11) Scaling Up Vision-Language Pre-training for Image Captioning, [Paper]
(arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]
(arXiv 2021.11) RedCaps: Web-curated image-text data created by the people, for the people, [Paper], [Project]
(arXiv 2021.11) EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching, [Paper], [Code]
(arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper], [Code]
(arXiv 2021.11) Vis-TOP: Visual Transformer Overlay Processor, [Paper]
(arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]
(arXiv 2021.11) Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints, [Paper]
(arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]
(arXiv 2021.11) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions, [Paper]
(arXiv 2021.11) Combined Scaling for Zero-shot Transfer Learning, [Paper]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]
(arXiv 2021.11) IBOT: IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER, [Paper], [Code]
(arXiv 2021.11) Masked Autoencoders Are Scalable Vision Learners, [Paper]
(arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]
(arXiv 2021.11) Are Transformers More Robust Than CNNs?, [Paper], [Code]
(arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]
(arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]
(arXiv 2021.11) LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, [Paper], [Project]
(arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]

2021.10

(arXiv 2021.10) Visual Keyword Spotting with Attention, [Paper], [[Project]](Visual Keyword Spotting with Attention)
(arXiv 2021.10) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery, [Paper], [Data & Code]
(arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval, [Paper], [Code]
(arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.10) Scatterbrain: Unifying Sparse and Low-rank Attention Approximation, [Paper]
(arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]
(arXiv 2021.10) UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model, [Paper], [Data & Code]
(arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation, [Paper]
(arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED EMOTION RECOGNITION, [Paper]
(arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for Visual Re-ranking, [Paper], [Code]
(arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2021.10) IMAGE-BASED CLIP-GUIDED ESSENCE TRANSFER, [Paper], [Code]
(arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic Attention, [Paper]
(arXiv 2021.10) ILLITERATE DALL·E LEARNS TO COMPOSE, [Paper], [Project], [Code]
(arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering, [Paper]
(arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]
(arXiv 2021.10) Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation, [Paper]
(arXiv 2021.10) TRANSFORMER ACCELERATION WITH DYNAMIC SPARSE ATTENTION, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization, [Paper]
(arXiv 2021.10) StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects, [Paper], [Project]
(arXiv 2021.10) Gophormer: Ego-Graph Transformer for Node Classification, [Paper]
(arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN GANS, [Paper], [Code]
(arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]
(arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper], [Code]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP, [Paper], [Code]
(arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]
(arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]
(arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]
(arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]
(arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]
(arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.10) Leveraging MoCap Data for Human Mesh Recovery, [Paper]
(arXiv 2021.10) A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models, [Paper]
(arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]
(arXiv 2021.10) Multimodal Dialogue Response Generation, [Paper]
(arXiv 2021.10) Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals, [Paper]
(arXiv 2021.10) COMPOSITIONAL ATTENTION: DISENTANGLING SEARCH AND RETRIEVAL, [Paper], [Code]
(arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]
(arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper]
(arXiv 2021.10) Transformer with a Mixture of Gaussian Keys, [Paper]
(arXiv 2021.10) DIFFUSIONCLIP: TEXT-GUIDED IMAGE MANIPULATION USING DIFFUSION MODELS, [Paper]
(arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]
(arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH SUB-QUADRATIC COMPLEXITY, [Paper]
(arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]
(arXiv 2021.10) CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, [Paper]
(arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]
(arXiv 2021.10) SPARSE MOES MEET EFFICIENT ENSEMBLES, [Paper]
(arXiv 2021.10) Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? [Paper]
(arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition, [Paper]
(arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper]
(arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]
(arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for Video Caption, [Paper]
(arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper]
(arXiv 2021.10) VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN, [Paper]
(arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) NVIT: VISION TRANSFORMER COMPRESSION AND PARAMETER REDISTRIBUTION, [Paper]
(arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]
(arXiv 2021.10) CLIP-Adapter: Better Vision-Language Models with Feature Adapters, [Paper], [Code]
(arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Code] ，
(arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER, [Paper]
(arXiv 2021.10) TOKEN POOLING IN VISION TRANSFORMERS, [Paper]
(arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED OBJECT DETECTOR, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption: CLIP for Video Caption, [Paper]
(arXiv 2021.10) OBJECT-REGION VIDEO TRANSFORMERS, [Paper], [Code]
(arXiv 2021.10) LEVERAGING REDUNDANCY IN ATTENTION WITH REUSE TRANSFORMERS, [Paper]
(arXiv 2021.10) Dynamic Inference with Neural Interpreters, [Paper]
(arXiv 2021.10) A CLIP-Enhanced Method for Video-Language Understanding, [Paper]
(arXiv 2021.10) Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries, [Paper]
(arXiv 2021.10) Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection, [Paper]
(arXiv 2021.10) Learning Structural Representations for Recipe Generation and Food Retrieval, [Paper]
(arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION, [Paper]

2021.09

(arXiv 2021.09) Joint Multimedia Event Extraction from Video and Article, [Paper]
(arXiv 2021.09) Long-Range Transformers for Dynamic Spatiotemporal Forecasting, [Paper]
(arXiv 2021.09) Visually Grounded Concept Composition, [Paper]
(arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation, [Paper]
(arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]
(arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]
(arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper], [Code]
(arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]
(arXiv 2021.09) VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, [Paper], [Code]
(arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]
(arXiv 2021.09) CLIP-It! Language-Guided Video Summarization, [Paper], [Project]
(arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D FACIAL EXPRESSION RECOGNITION, [Paper]
(arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper], [Code]
(arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]
(arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper]
(arXiv 2021.09) MLIM: VISION-AND-LANGUAGE MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [Paper]
(arXiv 2021.09) Dense Contrastive Visual-Linguistic Pretraining, [Paper]
(arXiv 2021.09) CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED VISION-LANGUAGE MODELS, [Paper]
(arXiv 2021.09) Localizing ∞-shaped fishes: Sketch-guided object localization in the wild, [Paper], [Code]
(arXiv 2021.09) CLIPORT: What and Where Pathways for Robotic Manipulation, [Paper], [Project], [Code]
(arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]
(arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper], [Code]
(arXiv 2021.09) LOTR: Face Landmark Localization Using Localization Transformer, [Paper]
(arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
(arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]
(arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]
(arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]
(arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]
(arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION, [Paper]
(arXiv 2021.09) ActionCLIP: A New Paradigm for Video Action Recognition, [Paper]
(arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]
(arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering, [Paper], [Code]
(arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper], [Code]
(arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper], [Code]
(arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]
(arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]
(arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper]
(arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
(arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]
(arXiv 2021.09) Learning to Ground Visual Objects for Visual Dialog, [Paper]
(arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
(arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.09) IS ATTENTION BETTER THAN MATRIX DECOMPOSITION? [Paper], [Code]
(arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
(arXiv 2021.09) Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization, [Paper]
(arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding, [Paper]
(arXiv 2021.09) LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation, [Paper], [Code]
(arXiv 2021.09) Panoptic Narrative Grounding, [Paper]
(arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA, [Paper]
(arXiv 2021.09) PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks, [Paper], [Project]
(arXiv 2021.09) EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [Paper]
(arXiv 2021.09) Scaled ReLU Matters for Training Vision Transformers, [Paper]
(arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]
(arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper]
(arXiv 2021.09) WHYACT: Identifying Action Reasons in Lifestyle Vlogs, [Paper]
(arXiv 2021.09) Zero-Shot Open Set Detection by Extending CLIP, [Paper]
(arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]
(arXiv 2021.09) Learning to Prompt for Vision-Language Models, [Paper], [Code]
(arXiv 2021.09) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss, [Paper], [Code]
(arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
(arXiv 2021.09) ConvMLP: Hierarchical Convolutional MLPs for Vision, [Paper], [Code]
(arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
(arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
(arXiv 2021.09) Sparse-MLP: A Fully-MLP Architecture with Conditional Computation, [Paper]
(arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential Manipulation, [Paper], [Project]
(arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper]
(arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for Visual Question Answering, [Paper], [Code]
(arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE SUPER-RESOLUTION, [Paper]
(arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]
(arXiv 2021.09) Learning to Generate Scene Graph from Natural Language Supervision, [Paper], [Code]
(arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]
(arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]
(ICCV 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]
(arXiv 2021.09) Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Joint Graph Learning and Matching for Semantic Feature Correspondence, [Paper]
(arXiv 2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper], [Code]

2021.08

(arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation, [Paper]
(arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]
(arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]
(arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]
(arXiv 2021.08) Cross-category Video Highlight Detection via Set-based Learning, [Paper], [Code]
(arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]
(arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments, [Paper]
(arXiv 2021.08) LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision, [Paper], [Project]
(arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]
(arXiv 2021.08) SIMVLM: SIMPLE VISUAL LANGUAGE MODEL PRETRAINING WITH WEAK SUPERVISION, [Paper]
(arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [Paper]
(arXiv 2021.08) Efficient Transformer for Single Image Super-Resolution, [Paper]
(arXiv 2021.08) Discovering Spatial Relationships by Transformers for Domain Generalization, [Paper]
(arXiv 2021.08) TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment, [Paper]
(arXiv 2021.08) MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition, [Paper]
(arXiv 2021.08) SwinIR: Image Restoration Using Swin Transformer, [Paper], [Code]
(arXiv 2021.08) Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training, [Paper]
(arXiv 2021.08) Improving 3D Object Detection with Channel-wise Transformer, [Paper]
(arXiv 2021.08) No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency, [Paper], [Code]
(arXiv 2021.08) SOTR: Segmenting Objects with Transformers, [Paper], [Code]
(arXiv 2021.08) ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration, [Paper], [Code]
(arXiv 2021.08) Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism, [Paper], [Code]
(arXiv 2021.08) End-to-End Dense Video Captioning with Parallel Decoding, [Paper], [Code]
(arXiv 2021.08) Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance, [Paper]
(arXiv 2021.08) Video Relation Detection via Tracklet based Visual Transformer, [Paper], [Code]
(arXiv 2021.08) PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers, [Paper], [Code]
(arXiv 2021.08) ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis, [Paper], [Project]
(arXiv 2021.08) Do Vision Transformers See Like Convolutional Neural Networks? [Paper]
(arXiv 2021.08) TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.08) MUSIQ: Multi-scale Image Quality Transformer, [Paper]
(arXiv 2021.08) Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning, [Paper], [Code]
(arXiv 2021.08) Conditional DETR for Fast Training Convergence, [Paper], [Code]
(arXiv 2021.08) Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper], [Code]
(arXiv 2021.08) Mobile-Former: Bridging MobileNet and Transformer, [Paper]
(arXiv 2021.08) Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation), [Paper], [Code]
(arXiv 2021.08) Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations, [Paper]
(arXiv 2021.08) Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion, [Paper], [Code]
(arXiv 2021.08) Video Transformer for Deepfake Detection with Incremental Learning, [Paper]
(arXiv 2021.08) ConvNets vs. Transformers: Whose Visual Representations are More Transferable? [Paper]
(arXiv 2021.08) A Transformer-based Math Language Model for Handwritten Math Expression Recognition, [Paper]
(arXiv 2021.08) Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers, [Paper]
(arXiv 2021.08) TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding, [Paper]
(arXiv 2021.08) Fast Convergence of DETR with Spatially Modulated Co-Attention. [Paper], [Code]
(arXiv 2021.08) Token Shift Transformer for Video Classification, [Paper], [Code]
(arXiv 2021.08) Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer, [Paper], [Code]
(arXiv 2021.08) Joint Inductive and Transductive Learning for Video Object Segmentation, [Paper], [Code]
(arXiv 2021.08) OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning, [Paper]
(arXiv 2021.08) Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, [Paper], [Code-1], [Code-2]
(arXiv 2021.08) TransForensics: Image Forgery Localization with Dense Self-Attention, [Paper]
(arXiv 2021.08) TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network, [Paper]
(arXiv 2021.08) Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models, [Paper], [Code]
(arXiv 2021.08) The Right to Talk: An Audio-Visual Transformer Approach, [Paper]
(arXiv 2021.08) PSViT: Better Vision Transformer via Token Pooling and Attention Sharing, [Paper]
(arXiv 2021.08) Unifying Global-Local Representations in Salient Object Detection with Transformer, [Paper], [Code]
(arXiv 2021.08) Boosting Few-shot Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.08) Vision Transformer with Progressive Sampling, [Paper], [Code]
(arXiv 2021.08) Armour: Generalizable Compact Self-Attention for Vision Transformers, [Paper]
(arXiv 2021.08) Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer, [Paper]
(arXiv 2021.08) S^2-MLPV2: IMPROVED SPATIAL-SHIFT MLP ARCHITECTURE FOR VISION, [Paper]
(arXiv 2021.08) Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer, [Paper]
(arXiv 2021.08) Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning, [Paper]
(arXiv 2021.08) CROSSFORMER: A VERSATILE VISION TRANSFORMER BASED ON CROSS-SCALE ATTENTION, [Paper], [Code]
(arXiv 2021.08) Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding, [Paper]
(arXiv 2021.08) Transformer-based deep imitation learning for dual-arm robot manipulation, [Paper]
(arXiv 2021.08) GTNet:Guided Transformer Network for Detecting Human-Object Interactions, [Paper], [Code]

2021.07

(arXiv 2021.07) Perceiver IO: A General Architecture for Structured Inputs & Outputs, [Paper], [Code]
(arXiv 2021.07) DPT: Deformable Patch-based Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.07) Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining, [Paper]
(arXiv 2021.07) Exceeding the Limits of Visual-Linguistic Multi-Task Learning, [Paper]
(arXiv 2021.07) UIBert: Learning Generic Multimodal Representations for UI Understanding, [Paper]
(arXiv 2021.07) Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection, [Paper]
(arXiv 2021.07) A Unified Efficient Pyramid Transformer for Semantic Segmentation, [Paper]
(arXiv 2021.07) PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion, [Paper]
(arXiv 2021.07) ReFormer: The Relational Transformer for Image Captioning, [Paper]
(arXiv 2021.07) Rethinking and Improving Relative Position Encoding for Vision Transformer, [Paper], [Code]
(arXiv 2021.07) Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers, [Paper]
(arXiv 2021.07) PlaneTR: Structure-Guided Transformers for 3D Plane Recovery, [Paper], [Code]
(arXiv 2021.07) Is Object Detection Necessary for Human-Object Interaction Recognition? [Paper]
(arXiv 2021.07) Exceeding the Limits of Visual-Linguistic Multi-Task Learning, [Paper]
(arXiv 2021.07) Don’t Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers, [Paper]
(arXiv 2021.07) Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers, [Paper], [Code]
(arXiv 2021.07) Go Wider Instead of Deeper, [Paper]
(arXiv 2021.07) Contextual Transformer Networks for Visual Recognition, [Paper], [Code]
(arXiv 2021.07) Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives, [Paper]
(arXiv 2021.07) Query2Label: A Simple Transformer Way to Multi-Label Classification, [Paper], [Code]
(arXiv 2021.07) EAN: Event Adaptive Network for Enhanced Action Recognition, [Paper], [Code]
(arXiv 2021.07) CycleMLP: A MLP-like Architecture for Dense Prediction, [Paper], [Code]
(arXiv 2021.07) Generative Video Transformer: Can Objects be the Words? [Paper]
(arXiv 2021.07) QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries, [Paper], [Code]
(arXiv 2021.07) PICASO: Permutation-Invariant Cascaded Attentional Set Operator, [Paper], [Code]
(arXiv 2021.07) RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition, [Paper]
(arXiv 2021.07) OODformer: Out-Of-Distribution Detection Transformer, [Paper], [Code]
(arXiv 2021.07) Image Fusion Transformer, [Paper], [Code]
(arXiv 2021.07) ResT: An Efficient Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.07) STAR: Sparse Transformer-based Action Recognition, [Paper], [Code]
(arXiv 2021.07) Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition, [Paper]
(arXiv 2021.07) How Much Can CLIP Benefit Vision-and-Language Tasks? [Paper]
(arXiv 2021.07) Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms, [Paper], [Code]
(arXiv 2021.07) Visual Parser: Representing Part-whole Hierarchies with Transformers, [Paper], [Code]
(arXiv 2021.07) Combiner: Full Attention Transformer with Sparse Computation Cost, [Paper]
(arXiv 2021.07) Per-Pixel Classification is Not All You Need for Semantic Segmentation, [Paper], [Project]
(arXiv 2021.07) Learning Multi-Scene Absolute Pose Regression with Transformers, [Paper]
(arXiv 2021.07) CMT: Convolutional Neural Networks Meet Vision Transformers, [Paper]
(arXiv 2021.07) HAT: Hierarchical Aggregation Transformers for Person Re-identification, [Paper], [Code]
(arXiv 2021.07) THE BROWNIAN MOTION IN THE TRANSFORMER MODEL, [Paper]
(arXiv 2021.07) Local-to-Global Self-Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.07) Scenes and Surroundings: Scene Graph Generation using Relation Transformer, [Paper]
(arXiv 2021.07) ViTGAN: Training GANs with Vision Transformers, [Paper]
(arXiv 2021.07) Long-Short Temporal Contrastive Learning of Video Transformers, [Paper]
(arXiv 2021.07) PVTv2: Improved Baselines with Pyramid Vision Transformer, [Paper], [Code]
(arXiv 2021.07) Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers, [Paper], [Code]
(arXiv 2021.07) LanguageRefer: Spatial-Language Model for 3D Visual Grounding, [Paper]
(arXiv 2021.07) EEG-CONVTRANSFORMER FOR SINGLE-TRIAL EEG BASED VISUAL STIMULI CLASSIFICATION, [Paper]
(arXiv 2021.07) Feature Fusion Vision Transformer for Fine-Grained Visual Categorization, [Paper]
(arXiv 2021.07) Long-Short Transformer: Efficient Transformers for Language and Vision, [Paper]
(arXiv 2021.07) TransformerFusion: Monocular RGB Scene Reconstruction using Transformers, [Paper]
(arXiv 2021.07) VIDLANKD: Improving Language Understanding via Video-Distilled Knowledge Transfer, [Paper], [Code]
(arXiv 2021.07) GLiT: Neural Architecture Search for Global and Local Image Transformer, [Paper]
(arXiv 2021.07) LEARNING VISION TRANSFORMER WITH SQUEEZE AND EXCITATION FOR FACIAL EXPRESSION RECOGNITION, [Paper]
(arXiv 2021.07) Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World, [Paper]
(arXiv 2021.07) Long Short-Term Transformer for Online Action Detection, [Paper]
(arXiv 2021.07) VISION XFORMERS: EFFICIENT ATTENTION FOR IMAGE CLASSIFICATION, [Paper]
(arXiv 2021.07) Test-Time Personalization with a Transformer for Human Pose Estimation, [Paper], [Code]
(arXiv 2021.07) What Makes for Hierarchical Vision Transformer? [Paper]
(arXiv 2021.07) Efficient Vision Transformers via Fine-Grained Manifold Distillation, [Paper]
(arXiv 2021.07) Visual Relationship Forecasting in Videos, [Paper]
(arXiv 2021.07) Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots, [Paper]
(arXiv 2021.07) Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions, [Paper]
(arXiv 2021.07) CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, [Paper], [Code]
(arXiv 2021.07) CLIP-It! Language-Guided Video Summarization, [Paper], [Code]
(arXiv 2021.07) AutoFormer: Searching Transformers for Visual Recognition, [Paper], [Code]
(arXiv 2021.07) Focal Self-attention for Local-Global Interactions in Vision Transformers, [Paper]
(arXiv 2021.07) Global Filter Networks for Image Classification, [Paper], [Code]
(arXiv 2021.07) VideoLightFormer: Lightweight Action Recognition using Transformers, [Paper]
(arXiv 2021.07) OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation, [Paper]
(arXiv 2021.07) TransSC: Transformer-based Shape Completion for Grasp Evaluation, [Paper]
(arXiv 2021.07) Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition, [Paper]

2021.06

(arXiv 2021.06) Associating Objects with Transformers for Video Object Segmentation, [Paper], [Code]
(arXiv 2021.06) Video Super-Resolution Transformer, [Paper], [Code]
(arXiv 2021.06) Thinking Like Transformers, [Paper]
(arXiv 2021.06) Kernel Identification Through Transformers, [Paper]
(arXiv 2021.06) XCiT: Cross-Covariance Image Transformers, [Paper]
(arXiv 2021.06) THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers, [Paper]
(arXiv 2021.06) Probing Image–Language Transformers for Verb Understanding, [Paper]
(arXiv 2021.06) How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers, [Paper], [Code], [Model]
(arXiv 2021.06) End-to-end Temporal Action Detection with Transformer, [Paper], [Code]
(arXiv 2021.06) Efficient Self-supervised Vision Transformers for Representation Learning, [Paper]
(arXiv 2021.06) CLIP2Video: Mastering Video-Text Retrieval via Image CLIP, [Paper], [Code]
(arXiv 2021.06) Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers, [Paper], [Code]
(arXiv 2021.06) Transformed ROIs for Capturing Visual Transformations in Videos, [Paper]
(arXiv 2021.06) Transformer in Convolutional Neural Networks, [Paper], [Code]
(arXiv 2021.06) Video Instance Segmentation using Inter-Frame Communication Transformers, [Paper]
(arXiv 2021.06) Patch Slimming for Efficient Vision Transformers, [Paper]
(arXiv 2021.06) CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings, [Paper]
(arXiv 2021.06) RegionViT: Regional-to-Local Attention for Vision Transformers, [Paper]
(arXiv 2021.06) Motion Planning Transformers: One Model to Plan Them All, [Paper]
(arXiv 2021.06) Oriented Object Detection with Transformer, [Paper]
(arXiv 2021.06) Referring Transformer: A One-step Approach to Multi-task Visual Grounding, [Paper]
(arXiv 2021.06) Grounding inductive biases in natural images: invariance stems from variations in data, [Paper]
(arXiv 2021.06) CoAtNet: Marrying Convolution and Attention for All Data Sizes, [Paper]
(arXiv 2021.06) Scaling Vision Transformers, [Paper]
(arXiv 2021.06) Uformer: A General U-Shaped Transformer for Image Restoration, [Paper], [Code]
(arXiv 2021.06) Visual Transformer for Task-aware Active Learning, [Paper], [Code]
(arXiv 2021.06) Chasing Sparsity in Vision Transformers: An End-to-End Exploration, [Paper], [Code]
(arXiv 2021.06) DETReg: Unsupervised Pretraining with Region Priors for Object Detection, [Paper], [Code]
(arXiv 2021.06) MVT: MASK VISION TRANSFORMER FOR FACIAL EXPRESSION RECOGNITION IN THE WILD, [Paper]
(arXiv 2021.06) Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight, [Paper]
(arXiv 2021.06) Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer, [Paper]
(arXiv 2021.06) MlTr: Multi-label Classification with Transformer, [Paper], [Code]
(arXiv 2021.06) Going Beyond Linear Transformers with Recurrent Fast Weight Programmers, [Paper], [Code]
(arXiv 2021.06) On Improving Adversarial Transferability of Vision Transformers, [Paper], [Code]
(arXiv 2021.06) Fully Transformer Networks for Semantic Image Segmentation, [Paper]
(arXiv 2021.06) MST: Masked Self-Supervised Transformer for Visual Representation, [Paper]
(arXiv 2021.06) Space-time Mixing Attention for Video Transformer, [Paper]
(arXiv 2021.06) VIT-INCEPTION-GAN FOR IMAGE COLOURISING, [Paper]
(arXiv 2021.06) HYBRID GENERATIVE-CONTRASTIVE REPRESENTATION LEARNING, [Paper], [Code]
(arXiv 2021.06) OadTR: Online Action Detection with Transformers, [Paper], [Code]
(arXiv 2021.06) VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning, [Paper], [Code]
(arXiv 2021.06) Delving Deep into the Generalization of Vision Transformers under Distribution Shifts, [Paper], [Code]
(arXiv 2021.06) Improved Transformer for High-Resolution GANs, [Paper]
(arXiv 2021.06) Towards Long-Form Video Understanding, [Paper], [Code]
(arXiv 2021.06) TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [Paper]
(arXiv 2021.06) More than Encoder: Introducing Transformer Decoder to Upsample, [Paper]
(arXiv 2021.06) A Picture May Be Worth a Hundred Words for Visual Question Answering, [Paper]
(arXiv 2021.06) Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training, [Paper]
(arXiv 2021.06) Shape registration in the time of transformers, [Paper]
(arXiv 2021.06) Vision Transformer Architecture Search, [Paper], [Code]
(arXiv 2021.06) Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue, [Paper]
(arXiv 2021.06) Multi-Exit Vision Transformer for Dynamic Inference, [Paper]
(arXiv 2021.06) Early Convolutions Help Transformers See Better, [Paper]
(arXiv 2021.06) Rethinking Token-Mixing MLP for MLP-based Vision Backbone, [Paper]
(arXiv 2021.06) Augmented Shortcuts for Vision Transformers, [Paper]
(arXiv 2021.06) CAT: Cross Attention in Vision Transformer, [Paper], [Code]
(arXiv 2021.06) Post-Training Quantization for Vision Transformer, [Paper]
(arXiv 2021.06) Attention Bottlenecks for Multimodal Fusion, [Paper]
(arXiv 2021.06) Improving the Efficiency of Transformers for Resource-Constrained Devices, [Paper]
(arXiv 2021.06) Multimodal Few-Shot Learning with Frozen Language Models, [Paper]
(arXiv 2021.06) Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation, [Paper]
(arXiv 2021.06) Exploring Vision Transformers for Fine-grained Classification, [Paper], [Code]
(arXiv 2021.06) S^2-MLP: Spatial-Shift MLP Architecture for Vision, [Paper]
(arXiv 2021.06) Styleformer: Transformer based Generative Adversarial Networks with Style Vector, [Paper], [Code]
(arXiv 2021.06) ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias, [Paper], [Code]
(arXiv 2021.06) Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, [Paper]
(arXiv 2021.06) Refiner: Refining Self-attention for Vision Transformers, [Paper], [Code]
(arXiv 2021.06) Person Re-Identification with a Locally Aware Transformer, [Paper]
(arXiv 2021.06) Efficient Training of Visual Transformers with Small-Size Datasets, [Paper]
(arXiv 2021.06) Glance-and-Gaze Vision Transformer, [Paper], [Code]
(arXiv 2021.06) Few-Shot Segmentation via Cycle-Consistent Transformer, [Paper]
(arXiv 2021.06) Semantic Correspondence with Transformers, [Paper], [Code]
(arXiv 2021.06) THE IMAGE LOCAL AUTOREGRESSIVE TRANSFORMER, [Paper]
(arXiv 2021.06) MERLOT: Multimodal Neural Script Knowledge Models, [Paper], [Project]
(arXiv 2021.06) SOLQ: Segmenting Objects by Learning Queries, [Paper], [Code]
(arXiv 2021.06) Personalizing Pre-trained Models, [Paper], [Code]
(arXiv 2021.06) E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning, [Paper]
(arXiv 2021.06) VOLO: Vision Outlooker for Visual Recognition, [Paper], [Code]
(arXiv 2021.06) Container: Context Aggregation Network, [Paper]
(arXiv 2021.06) Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers, [Paper]
(arXiv 2021.06) Video Swin Transformer, [Paper], [Code]
(arXiv 2021.06) IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers, [Paper], [Code]
(arXiv 2021.06) AudioCLIP: Extending CLIP to Image, Text and Audio, [Paper]
(arXiv 2021.06) VISION PERMUTATOR: A PERMUTABLE MLP-LIKE ARCHITECTURE FOR VISUAL RECOGNITION, [Paper], [Code]
(arXiv 2021.06) Co-advise: Cross Inductive Bias Distillation, [Paper]
(arXiv 2021.06) Team PyKale (xy9) Submission to the EPIC-Kitchens 2021 Unsupervised Domain Adaptation Challenge for Action Recognition, [Paper]
(arXiv 2021.06) P2T: Pyramid Pooling Transformer for Scene Understanding, [Paper], [Code]
(arXiv 2021.06) LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction, [Paper]
(arXiv 2021.06) Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding, [Paper]
(arXiv 2021.06) MODETR: Moving Object Detection with Transformers, [Paper]
(arXiv 2021.06) ResMLP: Feedforward networks for image classification with data-efficient training, [Paper]
(arXiv 2021.06) Multi-head or Single-head? An Empirical Comparison for Transformer Training, [Paper]
(arXiv 2021.06) Dynamic Head: Unifying Object Detection Heads with Attentions, [Paper], [Code]
(arXiv 2021.06) MLP-Mixer: An all-MLP Architecture for Vision, [Paper], [Code]
(arXiv 2021.06) BEIT: BERT Pre-Training of Image Transformers, [Paper], [Code]
(arXiv 2021.06) Scaling Vision with Sparse Mixture of Experts, [Paper]
(arXiv 2021.06) Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition, [Paper]
(arXiv 2021.06) Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time, [Paper], [Code]
(arXiv 2021.06) DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, [Paper], [Code]
(arXiv 2021.06) SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation, [Paper]
(arXiv 2021.06) Anticipative Video Transformer, [Paper], [Project]
(arXiv 2021.06) Pay Attention to MLPs, [Paper]
(arXiv 2021.06) When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations, [Paper]
(arXiv 2021.06) StyTr^2: Unbiased Image Style Transfer with Transformers, [Paper]
(arXiv 2021.06) THG:Transformer with Hyperbolic Geometry, [Paper]
(arXiv 2021.06) You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection, [Paper], [Code]
(arXiv 2021.06) TransVOS: Video Object Segmentation with Transformers, [Paper]
(2021.06) Reinforcement Learning as One Big Sequence Modeling Problem, [Paper], [Project]
(arXiv 2021.06) Less is More: Pay Less Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.06) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, [Paper], [Code]

2021.05

(arXiv 2021.05) KVT: k-NN Attention for Boosting Vision Transformers, [Paper]
(arXiv 2021.05) Memory-Efficient Differentiable Transformer Architecture Search, [Paper]
(arXiv 2021.05) An Attention Free Transformer, [Paper]
(arXiv 2021.05) On the Bias Against Inductive Biases, [Paper]
(arXiv 2021.05) MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation, [Paper]
(arXiv 2021.05) Transformer-Based Source-Free Domain Adaptation, [Paper], [Code]
(arXiv 2021.05) FoveaTer: Foveated Transformer for Image Classification, [Paper]
(arXiv 2021.05) UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis, [Paper]
(arXiv 2021.05) Gaze Estimation using Transformer, [Paper], [Code]
(arXiv 2021.05) Transformer-Based Deep Image Matching for Generalizable Person Re-identification, [Paper], [Project]
(arXiv 2021.05) Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length, [Paper]
(arXiv 2021.05) Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model, [Paper]
(arXiv 2021.05) MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens, [Paper], [Code]
(arXiv 2021.05) Sequence Parallelism: Making 4D Parallelism Possible, [Paper]
(arXiv 2021.05) CogView: Mastering Text-to-Image Generation via Transformers, [Paper], [Code]
(arXiv 2021.05) TrTr: Visual Tracking with Transformer, [Paper], [Code]
(arXiv 2021.05) Conformer: Local Features Coupling Global Representations for Visual Recognition, [Paper], [Code]
(arXiv 2021.05) Visual Grounding with Transformers, [Paper]
(arXiv 2021.05) Self-Supervised Learning with Swin Transformers, [Paper], [Code]
(arXiv 2021.05) Are Pre-trained Convolutions Better than Pre-trained Transformers? [Paper]
(arXiv 2021.05) MOTR: End-to-End Multiple-Object Tracking with TRansformer, [Paper], [Code]
(arXiv 2021.05) Attention for Image Registration (AiR): an unsupervised Transformer approach, [Paper], [Code]
(arXiv 2021.05) EXPLORING EXPLICIT AND IMPLICIT VISUAL RELATIONSHIPS FOR IMAGE CAPTIONING, [Paper]
(arXiv 2021.05) Computer-Aided Design as Language, [Paper]
(arXiv 2021.05) FLEX: Parameter-free Multi-view 3D Human Motion Reconstruction, [Paper], [Project]
(arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper]
(arXiv 2021.05) High-Resolution Complex Scene Synthesis with Transformers, [Paper]
(arXiv 2021.05) Episodic Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.05) Towards Robust Vision Transformer, [Paper], [Code]
(arXiv 2021.05) Vision Transformers are Robust Learners, [Paper], [Code]
(arXiv 2021.05) ISTR: End-to-End Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.05) SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition, [Paper]
(arXiv 2021.05) Rethinking Skip Connection with Layer Normalization in Transformers and ResNets, [Paper]
(arXiv 2021.05) IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture, [Paper]
(arXiv 2021.05) Parallel Attention Network with Sequence Matching for Video Grounding, [Paper], [Code]
(arXiv 2021.05) Relative Positional Encoding for Transformers with Linear Complexity, [Paper]
(arXiv 2021.05) VTNET: VISUAL TRANSFORMER NETWORK FOR OBJECT GOAL NAVIGATION, [Paper]
(arXiv 2021.05) DeepCAD: A Deep Generative Network for Computer-Aided Design Models, [Paper]
(arXiv 2021.05) Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead, [Paper]
(arXiv 2021.05) An Attention Free Transformer, [Paper]
(arXiv 2021.05) Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks, [Paper], [Code]
(arXiv 2021.05) Combining Transformer Generators with Convolutional Discriminators, [Paper]
(arXiv 2021.05) VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding, [Paper]
(arXiv 2021.05) Improving Generation and Evaluation of Visual Stories via Semantic Consistency, [Paper], [Code]
(arXiv 2021.05) BELT: Blockwise Missing Embedding Learning Transfomer, [Paper]
(arXiv 2021.05) End-to-End Video Object Detection with Spatial-Temporal Transformers, [Paper], [Code]
(arXiv 2021.05) SAT: 2D Semantics Assisted Training for 3D Visual Grounding, [Paper]
(arXiv 2021.05) Aggregating Nested Transformers, [Paper]
(arXiv 2021.05) Intriguing Properties of Vision Transformers, [Paper], [Code]
(arXiv 2021.05) Temporal Action Proposal Generation with Transformers, [Paper]
(arXiv 2021.05) Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation, [Paper], [Code]
(arXiv 2021.05) Perceptual Image Quality Assessment with Transformers, [Paper]
(arXiv 2021.05) Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet, [Paper], [Code]
(arXiv 2021.05) Pay Attention to MLPs, [Paper]
(arXiv 2021.05) ResMLP: Feedforward networks for image classification with data-efficient training, [Paper]
(arXiv 2021.05) RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition, [Paper], [Code]
(arXiv 2021.05) Are Convolutional Neural Networks or Transformers more like human vision? [Paper]
(arXiv 2021.05) FNet: Mixing Tokens with Fourier Transforms, [Paper]
(arXiv 2021.05) Segmenter: Transformer for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper]
(arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]

2021.04

(arXiv 2021.04) HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction, [Paper]
(arXiv 2021.04) Chop Chop BERT: Visual Question Answering by Chopping VisualBERT’s Heads, [Paper]
(arXiv 2021.04) CoSformer: Detecting Co-Salient Object with Transformers, [Paper]
(arXiv 2021.04) CAT: Cross-Attention Transformer for One-Shot Object Detection, [Paper]
(arXiv 2021.04) Dual Transformer for Point Cloud Analysis, [Paper]
(arXiv 2021.04) Playing Lottery Tickets with Vision and Language, [Paper]
(arXiv 2021.04) M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers, [Paper]
(arXiv 2021.04) RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory, [Paper], [Code]
(arXiv 2021.04) MDETR-Modulated Detection for End-to-End Multi-Modal Understanding, [Paper], [Code]
(arXiv 2021.04) Rich Semantics Improve Few-shot Learning, [Paper], [Code]
(arXiv 2021.04) Effect of Vision-and-Language Extensions on Natural Language Understanding in Vision-and-Language Models, [Paper]
(arXiv 2021.04) Token Labeling: Training an 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet, [Paper], [Code]
(arXiv 2021.04) So-ViT: Mind Visual Tokens for Vision Transforme, [Paper]
(arXiv 2021.04) Multiscale Vision Transformers, [Paper], [Code]
(arXiv 2021.04) M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection, [Paper]
(arXiv 2021.04) Transformer Transforms Salient Object Detection and Camouflaged Object Detection, [Paper]
(arXiv 2021.04) T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval, [Paper]
(arXiv 2021.04) VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization, [Paper]
(arXiv 2021.04) Multi-Modal Fusion Transformer for End-to-End Autonomous Driving, [Paper], [Code]
(arXiv 2021.04) TransVG: End-to-End Visual Grounding with Transformers, [Paper]
(arXiv 2021.04) Visual Transformer Pruning, [Paper]
(arXiv 2021.04) Higher Order Recurrent Space-Time Transformer, [Paper], [Code]
(arXiv 2021.04) CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval, [Paper], [Code]
(arXiv 2021.04) Lessons on Parameter Sharing across Layers in Transformers, [Paper]
(arXiv 2021.04) Disentangled Motif-aware Graph Learning for Phrase Grounding, [Paper]
(arXiv 2021.04) Co-Scale Conv-Attentional Image Transformers, [Paper], [Code]
(arXiv 2021.04) Cloth Interactive Transformer for Virtual Try-On, [Paper], [Code]
(arXiv 2021.04) LocalViT: Bringing Locality to Vision Transformers, [Paper], [Code]
(arXiv 2021.04) Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning, [Paper]
(arXiv 2021.04) Facial Attribute Transformers for Precise and Robust Makeup Transfer, [Paper]
(arXiv 2021.04) Emerging Properties in Self-Supervised Vision Transformers, [Paper], [Code]
(arXiv 2021.04) ConTNet: Why not use convolution and transformer at the same time? [Paper], [Code]
(arXiv 2021.04) Point Cloud Learning with Transformer, [Paper]
(arXiv 2021.04) Twins: Revisiting the Design of Spatial Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.04) Inpainting Transformer for Anomaly Detection, [Paper]
(arXiv 2021.04) Shot Contrastive Self-Supervised Learning for Scene Boundary Detection, [Paper]
(arXiv 2021.04) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [Paper]
(arXiv 2021.04) Visual Saliency Transformer, [Paper]
(arXiv 2021.04) Improve Vision Transformers Training by Suppressing Over-smoothing, [Paper], [Code]
(arXiv 2021.04) Visformer: The Vision-friendly Transformer, [Paper], [Code]
(arXiv 2021.04) TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking, [Paper]
(arXiv 2021.04) Mesh Graphormer, [Paper], [Code]
(arXiv 2021.04) TRAJEVAE - Controllable Human Motion Generation from Trajectories, [Paper]
(arXiv 2021.04) UC^2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training, [Paper]
(arXiv 2021.04) Learning to Cluster Faces via Transformer, [Paper]
(arXiv 2021.04) Skeletor: Skeletal Transformers for Robust Body-Pose Estimation, [Paper]
(arXiv 2021.04) VidTr: Video Transformer Without Convolutions, [Paper]
(arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper]
(arXiv 2021.04) Going deeper with Image Transformers, [Paper]
(arXiv 2021.04) EFFICIENT PRE-TRAINING OBJECTIVES FOR TRANSFORMERS, [Paper], [Code]
(arXiv 2021.04) ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING, [Paper]
(arXiv 2021.04) VideoGPT: Video Generation using VQ-VAE and Transformers, [Paper], [Code]
(arXiv 2021.04) DODRIO: Exploring Transformer Models with Interactive Visualization, [Paper], [Code]
(arXiv 2021.04) Lifting Transformer for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2021.04) Demystifying the Better Performance of Position Encoding Variants for Transformer, [Paper]
(arXiv 2021.04) Consistent Accelerated Inference via Confident Adaptive Transformers, [Paper], [Code]
(arXiv 2021.04) Temporal Query Networks for Fine-grained Video Understanding, [Paper], [Code]
(arXiv 2021.04) Face Transformer for Recognition, [Paper], [Code]
(arXiv 2021.04) VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks, [Paper]
(arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [Paper]
(arXiv 2021.04) Cross-Modal Retrieval Augmentation for Multi-Modal Classification, [Paper]
(arXiv 2021.04) Point-Based Modeling of Human Clothing, [Paper]
(arXiv 2021.04) Points as Queries: Weakly Semi-supervised Object Detection by Points, [Paper]
(arXiv 2021.04) Geometry-Free View Synthesis: Transformers and no 3D Priors, [Paper], [Code]
(arXiv 2021.04) Self-supervised Video Object Segmentation by Motion Grouping, [Paper], [Project]
(arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [Paper], [Code]
(arXiv 2021.04) Pose Recognition with Cascade Transformers, [Paper], [Code]
(arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper], [Project]
(arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]
(arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]
(arXiv 2021.04) Handwriting Transformers, [Paper]
(arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]
(arXiv 2021.04) EFFICIENT TRANSFORMERS IN REINFORCEMENT LEARNING USING ACTOR-LEARNER DISTILLATION, [Paper]
(arXiv 2021.04) Compressing Visual-linguistic Model via Knowledge Distillation, [Paper]
(arXiv 2021.04) When Pigs Fly: Contextual Reasoning in Synthetic and Natural Scenes, [Paper]
(arXiv 2021.04) Variational Transformer Networks for Layout Generation, [Paper]
(arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]
(arXiv 2021.04) Fourier Image Transformer, [Paper]
(arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]
(arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]
(arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]
(arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]
(arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]
(arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]
(arXiv 2021.04) VisQA: X-raying Vision and Language Reasoning in Transformers, [Paper]
(arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]
(arXiv 2021.04) Language-based Video Editing via Multi-Modal Multi-Level Transformer, [Paper]
(arXiv 2021.04) LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference, [Paper]
(arXiv 2021.04) LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]
(arXiv 2021.04) Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis, [Paper], [Project]
(arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]
(arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]
(arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]

2021.03

(arXiv 2021.03) TransCenter: Transformers with Dense Queries for Multiple-Object Tracking, [Paper]
(arXiv 2021.03) PixelTransformer: Sample Conditioned Signal Generation, [Paper], [Code]
(arXiv 2021.03) Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation, [Paper]
(arXiv 2021.03) DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention, [Paper]
(arXiv 2021.03) Learning Spatio-Temporal Transformer for Visual Tracking, [Paper], [Code]
(arXiv 2021.03) StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, [Paper], [Code]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.03) Robust Facial Expression Recognition with Convolutional Visual Transformers, [Paper]
(arXiv 2021.03) Describing and Localizing Multiple Changes with Transformers, [Paper], [Project]
(arXiv 2021.03) COTR: Correspondence Transformer for Matching Across Images, [Paper]
(arXiv 2021.03) nderstanding Robustness of Transformers for Image Classification, [Paper]
(arXiv 2021.03) CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, [Paper]
(arXiv 2021.03) Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers, [Paper]
(arXiv 2021.03) HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval, [Paper]
(arXiv 2021.03) TFPose: Direct Human Pose Estimation with Transformers, [Paper], [Code]
(arXiv 2021.03) Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, [Paper]
(arXiv 2021.03) Transformer Tracking, [Paper], [Code]
(arXiv 2021.03) ViViT: A Video Vision Transformer, [Paper]
(arXiv 2021.03) CvT: Introducing Convolutions to Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, [Paper], [Code]
(arXiv 2021.03) On the Adversarial Robustness of Visual Transformers, [Paper]
(arXiv 2021.03) Rethinking Spatial Dimensions of Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [Paper]
(arXiv 2021.03) Read and Attend: Temporal Localisation in Sign Language Videos, [Paper], [Benchmark]
(arXiv 2021.03) Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, [Paper]
(arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]
(arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]
(arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]
(arXiv 2021.03) Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning, [Paper], [Code]
(arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]
(arXiv 2021.03) Scene-Intuitive Agent for Remote Embodied Visual Grounding, [Paper]
(arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper]
(arXiv 2021.03) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]
(arXiv 2021.03) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, [Paper], [Code]
(arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]
(arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]
(arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]
(arXiv 2021.03) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]
(arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]
(arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]
(arXiv 2021.03) Paying Attention to Multiscale Feature Maps in Multimodal Image Matching, [Paper]
(arXiv 2021.03) HOPPER: MULTI-HOP TRANSFORMER FOR SPATIOTEMPORAL REASONING, [Paper], [Code]
(arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]
(arXiv 2021.03) AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting, [Paper], [Code]
(arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]
(arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]
(arXiv 2021.03) ConViT: Improving Vision Transformers ith Soft Convolutional Inductive Biases, [Paper], [Code]
(arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]
(arXiv 2021.03) On the Sentence Embeddings from Pre-trained Language Models, [Paper]
(arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]
(arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]
(arXiv 2021.03) Decoupled Spatial Temporal Graphs for Generic Visual Grounding, [Paper]
(arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]
(arXiv 2021.03) Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models, [Paper], [Code]
(arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]
(arXiv 2021.03) Causal Attention for Vision-Language Tasks, [Paper], [Code]
(arXiv 2021.03) Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks, [Paper]
(arXiv 2021.03) WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training, [Paper]
(arXiv 2021.03) Attention is not all you need: pure attention loses rank doubly exponentially with depth, [Paper]
(arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]
(arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]
(arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]
(arXiv 2021.03) Perceiver: General Perception with Iterative Attention, [Paper]
(arXiv 2021.03) Transformer in Transformer, [Paper], [Code]
(arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]
(arXiv 2021.03) OmniNet: Omnidirectional Representations from Transformers, [Paper]
(arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]

2021.02

(arXiv 2021.02) Evolving Attention with Residual Convolutions, [Paper]
(arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]
(arXiv 2021.02) SparseBERT: Rethinking the Importance Analysis in Self-attention, [Paper]
(arXiv 2021.02) Investigating the Limitations of Transformers with Simple Arithmetic Tasks, [Paper], [Code]
(arXiv 2021.02) Do Transformer Modifications Transfer Across Implementations and Applications? [Paper]
(arXiv.2021.02) Do We Really Need Explicit Position Encodings for Vision Transformers? [Paper], [Code]
(arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]
(arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]
(arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]
(arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]
(arXiv.2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]
(arXiv 2021.02) Linear Transformers Are Secretly Fast Weight Memory Systems, [Paper]
(arXiv.2021.02) POSITION INFORMATION IN TRANSFORMERS: AN OVERVIEW, [Paper]
(arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Project], [Code]
(arXiv 2021.02) Centroid Transformer: Learning to Abstract with Attention, [Paper]
(arXiv 2021.02) Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts, [Paper]
(arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]
(arXiv 2021.02) END-TO-END AUDIO-VISUAL SPEECH RECOGNITION WITH CONFORMERS, [Paper]
(arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]
(arXiv 2021.02) Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, [Paper], [Code]
(arXiv 2021.02) Video Transformer Network, [Paper]
(arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
(arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]
(arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
(arXiv 2021.02) Improving Visual Reasoning by Exploiting The Knowledge in Texts, [Paper]

2021.01

(arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]
(arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]
(arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation (arXiv 2021.1), [Paper]
(arXiv 2021.01) CPTR: FULL TRANSFORMER NETWORK FOR IMAGE CAPTIONING, [Paper]
(arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]
(arXiv 2021.01) Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, [Paper], [Code]
(arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]
(arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Page]
(arXiv 2021.01) Spherical Transformer: Adapting Spherical Signal to CNNs, [Paper]
(arXiv 2021.01) Are We There Yet? Learning to Localize in Embodied Instruction Following, [Paper]
(arXiv 2021.01) VinVL: Making Visual Representations Matter in Vision-Language Models, [Paper]
(arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper]
(arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
(arXiv 2021.01) ADDRESSING SOME LIMITATIONS OF TRANSFORMERS WITH FEEDBACK MEMORY, [Paper]
(arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]
(arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]
(arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]
(arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]
(arXiv 2021.01) Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, [Paper]

2020.12

(arXiv 2020.12) Cloud Transformers, [Paper]
(arXiv 2020.12) Accurate Word Representations with Universal Visual Guidance, [Paper]
(arXiv 2020.12) DETR for Pedestrian Detection, [Paper]
(arXiv 2020.12) Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]
(arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
(arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
(arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
(arXiv 2020.12) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]
(arXiv 2020.12) Transformer for Image Quality Assessment, [Paper], [Code]
(arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]
(arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
(arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper]
(arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]
(arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
(arXiv 2020.12) Point Transformer, [Paper]
(arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
(arXiv 2020.12) Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting, [Paper]
(arXiv 2020.12) Pre-Trained Image Processing Transformer, [Paper]
(arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]

2020.11

(arXiv 2020.11) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]
(arXiv 2020.11) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper]
(arXiv 2020.11) End-to-End Video Instance Segmentation with Transformers, [Paper]
(arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]
(arXiv 2020.11) General Multi-label Image Classification with Transformers, [[Paper]](https://arxiv.org/pdf/2011.14027}
(arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]

before 2020.11

(arXiv 2020.10) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]
(arXiv 2020.07) Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2020.07) Feature Pyramid Transformer, [Paper], [Code]
(arXiv 2020.06) Linformer: Self-Attention with Linear Complexity, [Paper]
(arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]
(arXiv 2019.08) LXMERT: Learning Cross-Modality Encoder Representations from Transformers, [Paper], [Code]
(ICLR'21) IOT: INSTANCE-WISE LAYER REORDERING FOR TRANSFORMER STRUCTURES, [Paper], [Code]
(ICLR'21) UPDET: UNIVERSAL MULTI-AGENT REINFORCEMENT LEARNING VIA POLICY DECOUPLING WITH TRANSFORMERS, [Paper], [Code]
(ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]
(ICLR'21) LAMBDANETWORKS: MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]
(ICLR'21) SUPPORT-SET BOTTLENECKS FOR VIDEO-TEXT REPRESENTATION LEARNING, [Paper]
(ICLR'21) COLORIZATION TRANSFORMER, [Paper], [Code]
(ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]
(ECCV'20) Connecting Vision and Language with Localized Narratives, [Paper]
(ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]
(CVPR'20) PaStaNet: Toward Human Activity Knowledge Engine, [Paper], [Code]
(CVPR'20) Multi-Modality Cross Attention Network for Image and Sentence Matching, [Paper], [Page]
(CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]
(CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]
(ICPR'20) Transformer Encoder Reasoning Network, [Paper], [Code]
(EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]

TODO

V-L representation learning (https://arxiv.org/pdf/2103.16110.pdf has provided a detailed table)

About

Recent Transformer-based CV and related works.