huiserwang / Transformer-in-Vision

Recent Transformer-based CV and related works.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Transformer-in-Vision

Recent Transformer-based CV and related works. Welcome to comment/contribute!

Keep updated.

Resource

Survey

  • (arXic 2022.06) Multimodal Learning with Transformers: A Survey, [Paper]

  • (arXic 2022.05) Vision Transformer: Vit and its Derivatives, [Paper]

  • (arXiv 2022.05) Transformers in 3D Point Clouds: A Survey, [Paper]

  • (arXiv 2022.04) Visual Attention Methods in Deep Learning: An In-Depth Survey, [Paper]

  • (arXiv 2022.04) Vision-and-Language Pretrained Models: A Survey, [Paper]

  • (arXiv 2022.03) A Roadmap for Big Model, [Paper]

  • (arXiv 2022.03) Transformers Meet Visual Learning Understanding: A Comprehensive Review, [[Paper]](https://arxiv.org/pdf/2203.12944.pdf)

  • (arXiv 2022.03) Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work, [Paper], [Project]

  • (arXiv 2022.02) A Survey of Vision-Language Pre-Trained Models, [Paper]

  • (arXiv 2022.02) VLP: A Survey on Vision-Language Pre-training, [Paper]

  • (arXiv 2022.02) Transformer for Graphs: An Overview from Architecture Perspective, [Paper]

  • (arXiv 2022.01) Video Transformers: A Survey, [Paper]

  • (arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP MLP, [Paper]

  • (arXiv 2021.11) A Survey of Visual Transformers, [Paper]

  • (arXiv 2021.09) Survey: Transformer based Video-Language Pre-training, [Paper]

  • (arXiv 2021.06) A Survey of Transformers, [Paper]

  • (arXiv 2021.06) Attention mechanisms and deep learning for machine vision: A survey of the state of the art, [Paper]

  • (arXiv 2021.06) Pre-Trained Models: Past, Present and Future, [Paper]

  • (arXiv 2021.05) Can Attention Enable MLPs To Catch Up With CNNs? [Paper]

  • (arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]

  • (arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]

  • (arXiv 2021.01) A Survey on Visual Transformer, [Paper]

  • (arXiv 2020.9) Efficient Transformers: A Survey, [Paper]

  • (arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

2022.07

  • (arXiv 2022.07) Distance Matters in Human-Object Interaction Detection, [Paper]

2022.06

  • (arXiv 2022.06) Rectify ViT Shortcut Learning by Visual Saliency, [Paper]

  • (arXiv 2022.06) Learning Using Privileged Information for Zero-Shot Action Recognition, [Paper]

  • (arXiv 2022.06) Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning, [Paper], [Code]

  • (arXiv 2022.06) CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer, [Paper], [Project]

  • (arXiv 2022.06) SimA: Simple Softmax-free Attention for Vision Transformers, [Paper], [Code]

  • (arXiv 2022.06) UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS, [Paper], [Project]

  • (arXiv 2022.06) VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix, [Paper], [Code]

  • (arXiv 2022.06) ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022, [Paper]

  • (arXiv 2022.06) Video + CLIP Baseline for Ego4D Long-term Action Anticipation, [Paper], [Code]

  • (arXiv 2022.06) What makes domain generalization hard?, [Paper]

  • (arXiv 2022.06) SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos, [Paper], [Code]

  • (arXiv 2022.06) Disentangling visual and written concepts in CLIP, [Paper], [Project]

  • (arXiv 2022.06) Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos, [Paper]

  • (arXiv 2022.06) Patch-level Representation Learning for Self-supervised Vision Transformers, [Paper]

  • (arXiv 2022.06) Zero-Shot Video Question Answering via Frozen Bidirectional Language Models, [Paper], [Code]

  • (arXiv 2022.06) OmniMAE: Single Model Masked Pretraining on Images and Videos, [Paper], [Code]

  • (arXiv 2022.06) Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency, [Paper], [Code]

  • (arXiv 2022.06) LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling, [Paper], [Code]

  • (arXiv 2022.06) Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World, [Paper]

  • (arXiv 2022.06) Rethinking Generalization in Few-Shot Classification, [Paper], [Code]

  • (arXiv 2022.06) VCT: A Video Compression Transformer, [Paper]

  • (arXiv 2022.06) Forecasting of depth and ego-motion with transformers and self-supervision, [Paper]

  • (arXiv 2022.06) Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone, [Paper], [Code]

  • (arXiv 2022.06) SP-ViT: Learning 2D Spatial Priors for Vision Transformers, [Paper]

  • (arXiv 2022.06) A Simple Data Mixing Prior for Improving Self-Supervised Learning, [Paper], [Code]

  • (arXiv 2022.06) Prefix Language Models are Unified Modal Learners, [Paper], [Code]

  • (arXiv 2022.06) Masked Frequency Modeling for Self-Supervised Visual Pre-Training, [Paper], [Code]](https://www.mmlab-ntu.com/project/mfm/index.html)

  • (arXiv 2022.06) Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer, [Paper]

  • (arXiv 2022.06) A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training, [Paper]

  • (arXiv 2022.06) Learning to Estimate Shapley Values with Vision Transformers, [Paper], [Code]

  • (arXiv 2022.06) Graph-based Spatial Transformer with Memory Replay for Multi-future Pedestrian Trajectory Prediction, [Paper], [Code]

  • (arXiv 2022.06) GLIPv2: Unifying Localization and VL Understanding, [Paper], [Code]

  • (arXiv 2022.06) INDIGO: Intrinsic Multimodality for Domain Generalization, [Paper]

  • (arXiv 2022.06) TRANSDUCTIVE CLIP WITH CLASS-CONDITIONAL CONTRASTIVE LEARNING, [Paper]

  • (arXiv 2022.06) SILVER-BULLET-3D AT MANISKILL 2021: LEARNING-FROM-DEMONSTRATIONS AND HEURISTIC RULE-BASED METHODS FOR OBJECT MANIPULATION, [Paper], [Code]

  • (arXiv 2022.06) MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing, [Paper], [Code]

  • (arXiv 2022.06) Visual Transformer for Object Detection, [Paper]

  • (arXiv 2022.06) Bringing **Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens, [Paper], [Code]

  • (arXiv 2022.06) TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer, [Paper]

  • (arXiv 2022.06) ReCo: Retrieve and Co-segment for Zero-shot Transfer, [Paper], [Project]

  • (arXiv 2022.06) MAREO: MEMORY- AND ATTENTION- BASED VISUAL REASONING, [Paper]

  • (arXiv 2022.06) Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis, [Paper]

  • (arXiv 2022.06) Object Scene Representation Transformer, [Paper]

  • (arXiv 2022.06) Comprehending and Ordering Semantics for Image Captioning, [Paper], [Code]

  • (arXiv 2022.06) Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO, [Paper]

  • (arXiv 2022.06) Peripheral Vision Transformer, [Paper], [Code]

  • (arXiv 2022.06) Efficient Decoder-free Object Detection with Transformers, [Paper], [Code]

  • (arXiv 2022.06) Prototypical Contrastive Language Image Pretraining, [Paper], [Code]

  • (arXiv 2022.06) SpA-Former:Transformer image** shadow detection and removal** via spatial attention, [Paper], [Code]

  • (arXiv 2022.06) A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers, [Paper]

  • (arXiv 2022.06) Can Foundation Models Talk Causality? [Paper]

  • (arXiv 2022.06) Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space, [Paper], [Code]

  • (arXiv 2022.06) MaskViT: Masked Visual Pre-Training for Video Prediction, [Paper]

  • (arXiv 2022.06) PromptPose: Language Prompt Helps Animal Pose Estimation, [Paper]

  • (arXiv 2022.06) Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos, [Paper]

  • (arXiv 2022.06) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]

  • (arXiv 2022.06) Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation, [Paper]

  • (arXiv 2022.06) Position Labels for Self-Supervised Vision Transformer, [Paper]

  • (arXiv 2022.06) Exploring Feature Self-relation for Self-supervised Transformer, [Paper]

  • (arXiv 2022.06) Patch-based Object-centric Transformers for Efficient Video Generation, [Paper], [Code]

  • (arXiv 2022.06) Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners, [Paper], [Code]

  • (arXiv 2022.06) VN-Transformer: Rotation-Equivariant Attention for Vector Neurons, [Paper]

  • (arXiv 2022.06) CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes, [Paper], [Code]

  • (arXiv 2022.06) OOD Augmentation May Be at Odds with Open-Set Recognition, [Paper]

  • (arXiv 2022.06) Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer, [Paper]

  • (arXiv 2022.06) cycle text2face: cycle text-to-face gan via transformers, [Paper]

  • (arXiv 2022.06) Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer, [Paper], [Code]

  • (arXiv 2022.06) Transformer based Urdu Handwritten Text Optical Character Reader, [Paper]

  • (arXiv 2022.06) Spatial Entropy Regularization for Vision Transformers, [Paper]

  • (arXiv 2022.06) On Data Scaling in Masked Image Modeling, [Paper]

  • (arXiv 2022.06) Extreme Masking for Learning Instance and Distributed Visual Representations, [Paper]

  • (arXiv 2022.06) GateHUB: Gated History Unit with Background Suppression for Online Action Detection, [Paper]

  • (arXiv 2022.06) Anomaly detection in surveillance videos using transformer based attention model, [Paper], [Code]

  • (arXiv 2022.06) ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences, [Paper], [Code]

  • (arXiv 2022.06) EAANet: Efficient Attention Augmented Convolutional Networks, [Paper]

  • (arXiv 2022.06) Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning, [Paper]

  • (arXiv 2022.06) Recurrent Video Restoration Transformer with Guided Deformable Attention, [Paper], [Code]

  • (arXiv 2022.06) Rethinking the Openness of CLIP, [Paper]

  • (arXiv 2022.06) OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression, [Paper]

  • (arXiv 2022.06) Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval, [Paper]

  • (arXiv 2022.06) CONTRASTIVE GRAPH MULTIMODAL MODEL FOR TEXT CLASSIFICATION IN VIDEOS, [Paper]

  • (arXiv 2022.06) Separable Self-attention for Mobile Vision Transformers, [Paper], [Code]

  • (arXiv 2022.06) Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation, [Paper], [Code]

  • (arXiv 2022.06) Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts, [Paper]

  • (arXiv 2022.06) cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation, [Paper]

  • (arXiv 2022.06) Masked Unsupervised Self-training for Zero-shot Image Classification, [Paper], [Code]

  • (arXiv 2022.06) DETR++: Taming Your Multi-Scale Detection Transformer, [Paper]

  • (arXiv 2022.06) Structured Context Transformer for Generic Event Boundary Detection, [Paper]

  • (arXiv 2022.06) Revealing Single Frame Bias for Video-and-Language Learning, [Paper], [Code]

  • (arXiv 2022.06) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]

  • (arXiv 2022.06) Can CNNs Be More Robust Than Transformers? [Paper], [Code]

  • (arXiv 2022.06) Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding, [Paper]

  • (CVPR 2022) Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation, [Paper]

  • (arXiv 2022.06) A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge, [Paper], [Project]

  • (arXiv 2022.06) Revisiting the “Video” in Video-Language Understanding, [Paper], [Project]

  • (arXiv 2022.06) Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction, [Paper]

  • (arXiv 2022.06) Modeling Image Composition for Complex Scene Generation, [Paper], [Code]

  • (arXiv 2022.06) Unified Recurrence Modeling for Video Action Anticipation, [Paper]

  • (arXiv 2022.06) Prefix Conditioning Unifies Language and Label Supervision, [Paper]

  • (arXiv 2022.06) Optimizing Relevance Maps of Vision Transformers Improves Robustness, [Paper], [Code]

  • (arXiv 2022.06) VL-BEIT: Generative Vision-Language Pretraining, [Paper], [Code]

  • (arXiv 2022.06) EfficientFormer: Vision Transformers at MobileNet Speed, [Paper], [Code]

  • (arXiv 2022.06) REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering, [Paper]

  • (arXiv 2022.06) Siamese Image Modeling for Self-Supervised Vision Representation Learning, [Paper]

  • (CVPR 2022) Distillation Using Oracle Queries for Transformer-based Human-Object nteraction Detection, [Paper], [Code]

  • (CVPR 2022) Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection, [Paper], [Code]

  • (CVPR 2022) Human Trajectory Prediction with Momentary Observation, [Paper]

  • (arXiv 2022.06) Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer, [Paper]

  • (arXiv 2022.06) Unifying Voxel-based Representation with Transformer for 3D Object Detection, [Paper], [Code]

  • (arXiv 2022.06) Extreme Floorplan Reconstruction by Structure-Hallucinating Transformer Cascades, [Paper]

  • (arXiv 2022.06) Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training, [Paper]

  • (arXiv 2022.06) VALHALLA: Visual Hallucination for Machine Translation, [Paper], [Code]

  • (arXiv 2022.06) Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation, [Paper]

  • (arXiv 2022.06) CLIP4IDC: CLIP for Image Difference Captioning, [Paper], [Code]

  • (arXiv 2022.06) Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment, [Paper]

  • (arXiv 2022.06) Vision GNN: An Image is Worth Graph of Nodes, [Paper], [Code]

  • (arXiv 2022.06) Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction, [Paper], [Code]

  • (arXiv 2022.06) TubeFormer-DeepLab: Video Mask Transformer, [Paper]

  • (arXiv 2022.06) Video-based Human-Object Interaction Detection from Tubelet Tokens, [Paper]

2022.05

  • (arXiv 2022.05) HeatER: An Efficient and Unified Network for Human Reconstruction via Heatmap-based TransformER, [Paper]

  • (arXiv 2022.05) Robotic grasp detection based on Transformer, [Paper]

  • (arXiv 2022.05) Multimodal Masked Autoencoders Learn Transferable Representations, [Paper]

  • (arXiv 2022.05) Multimodal Fake News Detection via CLIP-Guided Learning, [Paper]

  • (arXiv 2022.05) WT-MVSNet: Window-based Transformers for Multi-view Stereo, [Paper]

  • (arXiv 2022.05) Object-wise Masked Autoencoders for Fast Pre-training, [Paper]

  • (arXiv 2022.05) A Closer Look at Self-supervised Lightweight Vision Transformers, [Paper]

  • (arXiv 2022.05) Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning, [Paper]

  • (arXiv 2022.05) CYCLIP: Cyclic Contrastive Language-Image Pretraining, [Paper], [Code]

  • (arXiv 2022.05) MDMLP: Image Classification from Scratch on Small Datasets with MLP, [Paper], [Code]

  • (arXiv 2022.05) SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners, [Paper], [Code]

  • (arXiv 2022.05) 3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction, [Paper]

  • (arXiv 2022.05) Prompt-aligned Gradient for Prompt Tuning, [Paper], [Code]

  • (arXiv 2022.05) Illumination Adaptive Transformer, [Paper], [Code]

  • (arXiv 2022.05) HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling, [Paper]

  • (arXiv 2022.05) GMML is All you Need, [Paper], [Code]

  • (arXiv 2022.05) COMPLETEDT: POINT CLOUD COMPLETION WITH DENSE AUGMENT INFERENCE TRANSFORMERS, [Paper]

  • (arXiv 2022.05) Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks, [Paper]

  • (arXiv 2022.05) VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models, [Paper], [Benchmark], [Code]

  • (arXiv 2022.05) Architecture-Agnostic Masked Image Modeling – From ViT back to CNN, [Paper]

  • (arXiv 2022.05) Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation, [Paper], [Code]

  • (arXiv 2022.05) GIT: A Generative Image-to-text Transformer for Vision and Language, [Paper]

  • (arXiv 2022.05) 3DILG: Irregular Latent Grids for 3D Generative Modeling, [Paper]

  • (arXiv 2022.05) Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos, [Paper], [Code]

  • (arXiv 2022.05) Future Transformer for Long-term Action Anticipation, [Paper], [Project]

  • (arXiv 2022.05) X-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]

  • (arXiv 2022.05) Knowledge Distillation via the Target-aware Transformer, [Paper]

  • (arXiv 2022.05) Dynamic Query Selection for Fast Visual Perceiver, [Paper]

  • (arXiv 2022.05) MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers, [Paper]

  • (arXiv 2022.05) PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models, [Paper], [Code]

  • (arXiv 2022.05) Supporting Vision-Language Model Inference with Causality-pruning Knowledge Prompt, [Paper]

  • (arXiv 2022.05) Super Vision Transformer, [Paper], [Code]

  • (arXiv 2022.05) mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections, [Paper]

  • (arXiv 2022.05) VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering, [Paper]

  • (arXiv 2022.05) UMSNet: An Universal Multi-sensor Network for Human Activity Recognition, [Paper]

  • (arXiv 2022.05) Privacy-Preserving Image Classification Using Vision Transformer, [Paper]

  • (arXiv 2022.05) HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval, [Paper]

  • (arXiv 2022.05) ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions, [Paper], [Code]

  • (arXiv 2022.05) HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding, [Paper]

  • (arXiv 2022.05) Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning, [Paper]

  • (arXiv 2022.05) Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging, [Paper]

  • (arXiv 2022.05) Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality, [Paper], [Code]

  • (arXiv 2022.05) Visual Concepts Tokenization, [Paper]

  • (arXiv 2022.05) MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion, [Paper]

  • (arXiv 2022.05) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers., [Paper], [Code]

  • (arXiv 2022.05) Evidence for Hypodescent in Visual Semantic AI, [Paper]

  • (arXiv 2022.05) Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer, [Paper], [Code]

  • (arXiv 2022.05) muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems, [Paper]

  • (arXiv 2022.05) Large Language Models are Zero-Shot Reasoners, [Paper]

  • (arXiv 2022.05) AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition, [Paper], [Code]

  • (arXiv 2022.05) Green Hierarchical Vision Transformer for Masked Image Modeling, [Paper], [Code]

  • (arXiv 2022.05) Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation, [Paper]

  • (arXiv 2022.05) Cross-Architecture Self-supervised Video Representation Learning, [Paper], [Code]

  • (arXiv 2022.05) Prompt-based Learning for Unpaired Image Captioning, [Paper]

  • (arXiv 2022.05) MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning, [Paper], [Code]

  • (arXiv 2022.05) Fast Vision Transformers with HiLo Attention, [Paper], [Code]

  • (arXiv 2022.05) Fine-grained Image Captioning with CLIP Reward, [Paper], [Code]

  • (arXiv 2022.05) Mutual Information Divergence: A Unified Metric for Multimodal Generative Models, [Paper]

  • (arXiv 2022.05) MoCoViT: Mobile Convolutional Vision Transformer, [Paper]

  • (arXiv 2022.05) AO2-DETR: Arbitrary-Oriented Object Detection Transformer, [Paper]

  • (arXiv 2022.05) Inception Transformer, [Paper], [Code]

  • (arXiv 2022.05) VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation, [Paper]

  • (arXiv 2022.05) UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, [Paper]

  • (arXiv 2022.05) Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners, [Paper], [Code]

  • (arXiv 2022.05) Training Vision-Language Transformers from Captions Alone, [Paper], [Code]

  • (arXiv 2022.05) Voxel-informed Language Grounding, [Paper], [Code]

  • (arXiv 2022.05) Cross-Enhancement Transformer for Action Segmentation, [Paper]

  • (arXiv 2022.05) TRT-ViT: TensorRT-oriented Vision Transformer, [Paper]

  • (arXiv 2022.05) Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection, [Paper]

  • (arXiv 2022.05) A graph-transformer for whole slide image classification, [Paper]

  • (arXiv 2022.05) VNT-Net: Rotational Invariant Vector Neuron Transformers, [Paper]

  • (arXiv 2022.05) Masked Image Modeling with Denoising Contrast, [Paper]

  • (arXiv 2022.05) Cross-subject Action Unit Detection with Meta Learning and Transformer-based Relation Modeling, [Paper]

  • (arXiv 2022.05) Masked Autoencoders As Spatiotemporal Learners, [Paper]

  • (arXiv 2022.05) BodyMap: Learning Full-Body Dense Correspondence Map, [Paper], [Code]

  • (arXiv 2022.05) Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers, [Paper]

  • (arXiv 2022.05) AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars, [Paper]

  • (arXiv 2022.05) Vision Transformer Adapter for Dense Predictions, [Paper], [Code]

  • (arXiv 2022.05) Demo: Real-Time Semantic Communications with a Vision Transformer, [Paper]

  • (arXiv 2022.05) MulT: An End-to-End Multitask Learning Transformer, [Paper], [Code]

  • (arXiv 2022.05) A CLIP-Hitchhiker’s Guide to Long Video Retrieval, [Paper]

  • (arXiv 2022.05) Video Frame Interpolation with Transformer, [Paper], [Code]

  • (arXiv 2022.05) Dense residual Transformer for Image Denoising, [Paper]

  • (arXiv 2022.05) Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT, [Paper]

  • (arXiv 2022.05) Robot Cooking with Stir-fry: Bimanual Non-prehensile Manipulation of Semi-fluid Objects, [Paper]

  • (arXiv 2022.05) Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos, [Paper], [Code]

  • (arXiv 2022.05) Learning to Retrieve Videos by Asking Questions, [Paper]

  • (arXiv 2022.05) One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code, [Paper]

  • (arXiv 2022.05) Simple Open-Vocabulary Object Detection with Vision Transformers, [Paper], [Code]

  • (arXiv 2022.05) AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation, [Paper], [Code]

  • (arXiv 2022.05) An Empirical Study of Self-supervised Learning Approaches for Object Detection with Transformers, [Paper], [Code-DETR], [Code-Deform-DETR]

  • (arXiv 2022.05) Reduce Information Loss in Transformers for Pluralistic Image Inpainting, [Paper], [Code]

  • (arXiv 2022.05) Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training, [Paper]

  • (arXiv 2022.05) Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild, [Paper]

  • (arXiv 2022.05) Generalizable Task Planning through Representation Pretraining, [Paper], [Project]

  • (arXiv 2022.05) EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers, [Paper]

  • (arXiv 2022.05) Activating More Pixels in Image Super-Resolution Transformer, [Paper], [Code]

  • (arXiv 2022.05) Row-wise Accelerator for Vision Transformer, [Paper]

  • (arXiv 2022.05) SparseTT: Visual Tracking with Sparse Transformers, [Paper], [Code]

  • (arXiv 2022.05) RoViST: Learning Robust Metrics for Visual Storytelling, [Paper], [Code]

  • (arXiv 2022.05) Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection, [Paper]

  • (arXiv 2022.05) Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering, [Paper]

  • (arXiv 2022.05) Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning, [Paper]

  • (arXiv 2022.05) ConvMAE: Masked Convolution Meets Masked Autoencoders, [Paper], [Code]

  • (arXiv 2022.05) Cross-lingual Adaptation for Recipe Retrieval with Mixup, [Paper]

  • (arXiv 2022.05) Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework, [Paper]

  • (arXiv 2022.05) Transformer Tracking with Cyclic Shifting Window Attention, [Paper], [Code]

  • (arXiv 2022.05) Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning, [Paper]

  • (arXiv 2022.05) Prompt Distribution Learning, [Paper]

  • (arXiv 2022.05) CLIP-CLOP: CLIP-Guided Collage and Photomontage, [Paper]

  • (arXiv 2022.05) Dual-Level Decoupled Transformer for Video Captioning, [Paper]

  • (arXiv 2022.05) Declaration-based Prompt Tuning for Visual Question Answering, [Paper], [Code]

  • (arXiv 2022.05) P^3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision, [Paper]

  • (arXiv 2022.05) Language Models Can See: Plugging Visual Controls in Text Generation, [Paper], [Code]

  • (arXiv 2022.05) YOLOPose: Transformer-based Multi-Object 6D Pose Estimation using Keypoint Regression, [Paper]

  • (arXiv 2022.05) Cross-view Transformers for real-time Map-view Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.05) i-Code: An Integrative and Composable Multimodal Learning Framework, [Paper]

  • (arXiv 2022.05) Visual Commonsense in Pretrained Unimodal and Multimodal Models, [Paper], [Project]

  • (arXiv 2022.05) Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification, [Paper]

  • (arXiv 2022.05) RecipeSnap - a lightweight image to recipe model, [Paper], [Code]

  • (arXiv 2022.05) CoCa: Contrastive Captioners are Image-Text Foundation Models, [Paper]

  • (arXiv 2022.05) Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP), [Paper]

  • (arXiv 2022.05) Cross-modal Representation Learning for Zero-shot Action Recognition, [Paper], [Code]

  • (arXiv 2022.05) Cross-Domain Object Detection with Mean-Teacher Transformer, [Paper]

  • (arXiv 2022.05) Better plain ViT baselines for ImageNet-1k, [Paper], [Code]

  • (arXiv 2022.05) Reinforced Swin-Convs Transformer for Underwater Image Enhancement, [Paper]

  • (arXiv 2022.05) UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog, [Paper]

  • (arXiv 2022.05) Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering, [Paper]

  • (arXiv 2022.05) CenterCLIP: Token Clustering for Efficient Text-Video Retrieval, [Paper], [Code]

  • (arXiv 2022.05) Arbitrary Shape Text Detection via Boundary Transformer, [Paper], [Code]

  • (arXiv 2022.05) HULC: 3D Human Motion Capture with Pose Manifold Sampling and Dense Contact Guidance, [Paper], [Project]

2022.04

  • (arXiv 2022.04) Learn to Understand Negation in Video Retrieval, [Paper]

  • (arXiv 2022.04) LayoutBERT: Masked Language Layout Model for Object Insertion, [Paper]

  • (arXiv 2022.04) Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, [Paper], [Code]

  • (arXiv 2022.04) Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer, [Paper]

  • (arXiv 2022.04) SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation, [Paper]

  • (arXiv 2022.04) Where in the World is this Image? Transformer-based Geo-localization in the Wild, [Paper]

  • (arXiv 2022.04) Depth Estimation with Simplified Transformer, [Paper]

  • (arXiv 2022.04) A very preliminary analysis of DALL-E 2, [Paper]

  • (arXiv 2022.04) CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, [Paper], [Code]

  • (arXiv 2022.04) CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification, [Paper], [Code]

  • (arXiv 2022.04) TEMOS: Generating diverse human motions from textual descriptions, [Paper], [Project]

  • (arXiv 2022.04) PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining, [Paper]

  • (arXiv 2022.04) Symmetric Transformer-based Network for Unsupervised Image Registration, [Paper], [Code]

  • (arXiv 2022.04) Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos, [Paper], [Code]

  • (arXiv 2022.04) CapOnImage: Context-driven Dense-Captioning on Image, [Paper]

  • (arXiv 2022.04) Self-Supervised Learning of Object Parts for Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.04) DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers, [Paper]

  • (arXiv 2022.04) CATrans: Context and Affinity Transformer for Few-Shot Segmentation, [Paper]

  • (arXiv 2022.04) Self-Driving Car Steering Angle Prediction: Let Transformer Be a Car Again, [Paper], [Code]

  • (arXiv 2022.04) ClothFormer: Taming Video Virtual Try-on in All Module, [Paper]

  • (arXiv 2022.04) Deeper Insights into ViTs Robustness towards Common Corruptions, [Paper]

  • (arXiv 2022.04) VITPOSE: SIMPLE VISION TRANSFORMER BASELINES FOR HUMAN POSE ESTIMATION, [Paper], [Code]

  • (arXiv 2022.04) Understanding The Robustness in Vision Transformers, [Paper], [Code]

  • (arXiv 2022.04) MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval, [Paper]

  • (arXiv 2022.04) Contrastive Language-Action Pre-training for Temporal Localization, [Paper]

  • (arXiv 2022.04) Boosting Adversarial Transferability of MLP-Mixer, [Paper]

  • (arXiv 2022.04) Adaptive Split-Fusion Transformer, [Paper], [Code]

  • (arXiv 2022.04) Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation? [Paper], [Project]

  • (arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]

  • (arXiv 2022.04) VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic Retail Checkout, [Paper], [Code]

  • (arXiv 2022.04) CLIP-DISSECT: AUTOMATIC DESCRIPTION OF NEURON REPRESENTATIONS IN DEEP VISION NETWORKS, [Paper]

  • (arXiv 2022.04) TEMOS: Generating diverse human motions from textual descriptions, [Paper], [Project]

  • (arXiv 2022.04) Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers, [Paper]

  • (arXiv 2022.04) SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images, [Paper], [Code]

  • (arXiv 2022.04) OCFormer: One-Class Transformer Network for Image Classification, [Paper]

  • (arXiv 2022.04) DRT: A Lightweight Single Image Deraining Recursive Transformer, [Paper], [Code]

  • (arXiv 2022.04) Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering, [Paper], [Code]

  • (arXiv 2022.04) ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer, [Paper]

  • (arXiv 2022.04) iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition, [Paper], [Code]

  • (arXiv 2022.04) DIVERSE INSTANCE DISCOVERY: VISION-TRANSFORMER FOR INSTANCE-AWARE MULTI-LABEL IMAGE RECOGNITION, [Paper]

  • (arXiv 2022.04) Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds, [Paper], [Code]

  • (arXiv 2022.04) DFAM-DETR: Deformable feature based attention mechanism DETR on slender object detection, [Paper]

  • (arXiv 2022.04) NFormer: Robust Person Re-identification with Neighbor Transformer, [Paper], [Code]

  • (arXiv 2022.04) Video Moment Retrieval from Text Queries via Single Frame Annotation, [Paper]

  • (arXiv 2022.04) GIMO: Gaze-Informed Human Motion Prediction in Context, [Paper]

  • (arXiv 2022.04) VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance, [Paper]

  • (arXiv 2022.04) Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments, [Paper]

  • (arXiv 2022.04) Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer, [Paper], [Code]

  • (arXiv 2022.04) Multimodal Token Fusion for Vision Transformers, [Paper]

  • (arXiv 2022.04) Self-Calibrated Efficient Transformer for Lightweight Super-Resolution, [Paper], [Code]

  • (arXiv 2022.04) Searching Intrinsic Dimensions of Vision Transformers, [Paper]

  • (arXiv 2022.04) Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks, [Paper]

  • (arXiv 2022.04) Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting, [Paper]

  • (arXiv 2022.04) Multi-Frame Self-Supervised Depth with Transformers, [Paper], [Code]

  • (arXiv 2022.04) MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction, [Paper], [Code]

  • (arXiv 2022.04) Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis, [Paper], [Code]

  • (arXiv 2022.04) An Extendable, Efficient and Effective Transformer-based Object Detector, [Paper], [Code]

  • (arXiv 2022.04) VDTR: Video Deblurring with Transformer, [Paper], [Code]

  • (arXiv 2022.04) BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment, [Paper], [Code]

  • (arXiv 2022.04) Temporally Efficient Vision Transformer for Video Instance Segmentation, [Paper], [Code]

  • (arXiv 2022.04) VSA: Learning Varied-Size Window Attention in Vision Transformers, [Paper], [Code]

  • (arXiv 2022.04) XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding, [Paper]

  • (arXiv 2022.04) IMPROVING CROSS-MODAL UNDERSTANDING IN VISUAL DIALOG VIA CONTRASTIVE LEARNING, [Paper]

  • (arXiv 2022.04) MVSTER: Epipolar Transformer for Efficient Multi-View Stereo, [Paper], [Code]

  • (arXiv 2022.04) UNCONDITIONAL IMAGE-TEXT PAIR GENERATION WITH MULTIMODAL CROSS QUANTIZER, [Paper]

  • (arXiv 2022.04) Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference, [Paper]

  • (arXiv 2022.04) COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval, [Paper]

  • (arXiv 2022.04) Image Captioning In the Transformer Age, [Paper], [Code]

  • (arXiv 2022.04) ResT V2: Simpler, Faster and Stronger, [Paper], [Code]

  • (arXiv 2022.04) Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer, [Paper], [Code]

  • (arXiv 2022.04) Temporal Progressive Attention for Early Action Prediction, [Paper], [Code]

  • (arXiv 2022.04) Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval, [Paper]

  • (arXiv 2022.04) Flamingo: a Visual Language Model for Few-Shot Learning, [Paper]

  • (arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]

  • (arXiv 2022.04) Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance, [Paper], [Code]

  • (arXiv 2022.04) Learning Future Object Prediction with a Spatiotemporal Detection Transformer, [Paper]

  • (arXiv 2022.04) R^2-Trans: Fine-Grained Visual Categorization with Redundancy Reduction, [Paper], [Code]

  • (arXiv 2022.04) A New Dataset and Transformer for Stereoscopic Video Super-Resolution, [Paper], [Code]

  • (arXiv 2022.04) Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization, [Paper]

  • (arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based Image Quality Assessment, [Paper], [Code]

  • (arXiv 2022.04) BTranspose: Bottleneck Transformers for Human Pose Estimation with Self-Supervised Pre-Training, [Paper]

  • (arXiv 2022.04) Human-Object Interaction Detection via Disentangled Transformer, [Paper]

  • (arXiv 2022.04) ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, [Paper]

  • (arXiv 2022.04) Interactiveness Field in Human-Object Interactions, [Paper], [Code]

  • (arXiv 2022.04) DeiT III: Revenge of the ViT, [Paper]

  • (arXiv 2022.04) Residual Swin Transformer Channel Attention Network for Image Demosaicing, [Paper]

  • (arXiv 2022.04) Neighborhood Attention Transformer, [Paper], [Code]

  • (arXiv 2022.04) MiniViT: Compressing Vision Transformers with Weight Multiplexing, [Paper], [Code]

  • (arXiv 2022.04) ViTOL: Vision Transformer for Weakly Supervised Object Localization, [Paper], [Code]

  • (arXiv 2022.04) What Matters in Language Conditioned Robotic Imitation Learning, [Paper], [Code]

  • (arXiv 2022.04) Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes, [Paper]

  • (arXiv 2022.04) ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension, [Paper]

  • (arXiv 2022.04) Are Multimodal Transformers Robust to Missing Modality? [Paper]

  • (arXiv 2022.04) TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.04) X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks, [Paper]

  • (arXiv 2022.04) Event Transformer, [Paper]

  • (arXiv 2022.04) Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels, [Paper]

  • (arXiv 2022.04) ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation, [Paper], [Code]

  • (arXiv 2022.04) Multimodal Transformer for Nursing Activity Recognition, [Paper], [Code]

  • (arXiv 2022.04) Robust Cross-Modal Representation Learning with Progressive Self-Distillation, [Paper]

  • (arXiv 2022.04) Stripformer: Strip Transformer for Fast Image Deblurring, [Paper]

  • (arXiv 2022.04) No Token Left Behind: Explainability-Aided Image Classification and Generation, [Paper]

  • (arXiv 2022.04) Fashionformer: A Simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition, [Paper], [Code]

  • (arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation, [Paper], [Code]

  • (arXiv 2022.04) DILEMMA: Self-Supervised Shape and Texture Learning with Transformers, [Paper]

  • (arXiv 2022.04) Learning Trajectory-Aware Transformer for Video Super-Resolution, [Paper], [Code]

  • (arXiv 2022.04) Learning to Induce Causal Structure, [Paper]

  • (arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection, [Paper], [Code]

  • (arXiv 2022.04) Category-Aware Transformer Network for Better Human-Object Interaction Detection, [Paper]

  • (arXiv 2022.04) Does Robustness on ImageNet Transfer to Downstream Tasks?, [Paper]

  • (arXiv 2022.04) POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition, [Paper], [Code]

  • (arXiv 2022.04) Vision Transformers for Single Image Dehazing, [Paper], [Code]

  • (arXiv 2022.04) Underwater Image Enhancement Using Pre-trained Transformer, [Paper]

  • (arXiv 2022.04) Event Transformer. A sparse-aware solution for efficient event data processing, [Paper], [Code]

  • (arXiv 2022.04) PSTR: End-to-End One-Step Person Search With Transformers, [Paper], [Code]

  • (arXiv 2022.04) Adapting CLIP For Phrase Localization Without Further Training, [Paper], [Code]

  • (arXiv 2022.04) FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment, [Paper], [Project]

  • (arXiv 2022.04) DaViT: Dual Attention Vision Transformers, [Paper], [Code]

  • (arXiv 2022.04) Unsupervised Prompt Learning for Vision-Language Models, [Paper], [Code]

  • (arXiv 2022.04) Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer, [Paper], [Project]

  • (arXiv 2022.04) Unified Contrastive Learning in Image-Text-Label Space, [Paper], [Code]

  • (arXiv 2022.04) HunYuan_tvr for Text-Video Retrivial, [Paper]

  • (arXiv 2022.04) LEARNING TO COMPOSE SOFT PROMPTS FOR COMPOSITIONAL ZERO-SHOT LEARNING, [Paper]

  • (arXiv 2022.04) End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation, [Paper], [Code]

  • (arXiv 2022.04) Temporal Alignment Networks for Long-term Video, [Paper], [Code]

  • (arXiv 2022.04) Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection, [Paper], [Code]

  • (arXiv 2022.04) MixFormer: Mixing Features across Windows and Dimensions, [Paper], [Code]

  • (arXiv 2022.04) CM3: A CAUSAL MASKED MULTIMODAL MODEL OF THE INTERNET, [Paper]

  • (arXiv 2022.04) DO AS I CAN, NOT AS I SAY: GROUNDING LANGUAGE IN ROBOTIC AFFORDANCES, [Paper], [Project]

  • (arXiv 2022.04) TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization, [Paper], [Code]

  • (arXiv 2022.04) Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, [Paper], [Project]

  • (arXiv 2022.04) Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition, [Paper]

  • (arXiv 2022.04) Learning Audio-Video Modalities from Image Captions, [Paper]

  • (arXiv 2022.04) Improving Vision Transformers by Revisiting High-frequency Components, [Paper]

  • (arXiv 2022.04) POS-BERT: Point Cloud One-Stage BERT Pre-Training, [Paper], [Code]

  • (arXiv 2022.04) BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation, [Paper], [Code]

  • (arXiv 2022.04) BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning, [Paper]

  • (arXiv 2022.04) TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting, [Paper]

  • (arXiv 2022.04) Long Movie Clip Classification with State-Space Video Models, [Paper], [Code]

  • (arXiv 2022.04) TALLFormer: Temporal Action Localization with Long-memory Transformer, [Paper], [Code]

  • (arXiv 2022.04) MultiMAE: Multi-modal Multi-task Masked Autoencoders, [Paper], [Project]

  • (arXiv 2022.04) “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations, [Paper]

  • (arXiv 2022.04) SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space, [Paper]

  • (arXiv 2022.04) Multi-View Transformer for 3D Visual Grounding, [Paper], [Code]

  • (arXiv 2022.04) VISION TRANSFORMER EQUIPPED WITH NEURAL RESIZER ON FACIAL EXPRESSION RECOGNITION TASK, [Paper]

  • (arXiv 2022.04) Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition, [Paper], [Project]

  • (arXiv 2022.04) Detector-Free Weakly Supervised Group Activity Recognition, [Paper]

  • (arXiv 2022.04) Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos, [Paper], [Project]

  • (arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions, [Paper]

  • (arXiv 2022.04) MaxViT: Multi-Axis Vision Transformer, [Paper]

2022.03

  • (arXiv 2022.03) Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation, [Paper]

  • (arXiv 2022.03) ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval, [Paper]

  • (arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [Paper], [Project]

  • (arXiv 2022.03) CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation, [Paper]

  • (arXiv 2022.03) Deformable Video Transformer, [Paper]

  • (arXiv 2022.03) End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps, [Paper]

  • (arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow, [Paper], [Code]

  • (arXiv 2022.03) VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, [Paper], [App]

  • (arXiv 2022.03) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing, [Paper], [Code]

  • (arXiv 2022.03) BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, [Paper], [Code]

  • (arXiv 2022.03) Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models, [Paper], [Code]

  • (arXiv 2022.03) Bringing Old Films Back to Life, [Paper], [Code]

  • (arXiv 2022.03) Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model, [Paper], [Code]

  • (arXiv 2022.03) SeqTR: A Simple yet Universal Network for Visual Grounding, [Paper], [Code]

  • (arXiv 2022.03) InstaFormer: Instance-Aware Image-to-Image Translation with Transformer, [Paper]

  • (arXiv 2022.03) Omni-DETR: Omni-Supervised Object Detection with Transformers, [Paper], [Code]

  • (arXiv 2022.03) Learning Program Representations for Food Images and Cooking Recipes, [Paper], [Project]

  • (arXiv 2022.03) ITTR: Unpaired Image-to-Image Translation with Transformers, [Paper]

  • (arXiv 2022.03) VPTR: Efficient Transformers for Video Prediction, [Paper], [Code]

  • (arXiv 2022.03) Parameter-efficient Fine-tuning for Vision Transformers, [Paper]

  • (arXiv 2022.03) TubeDETR: Spatio-Temporal Video Grounding with Transformers, [Paper], [Code]

  • (arXiv 2022.03) Exploring Plain Vision Transformer Backbones for Object Detection, [Paper]

  • (arXiv 2022.03) PROMPTDET: EXPAND YOUR DETECTOR VOCABULARY WITH UNCURATED IMAGES, [Paper], [Code]

  • (arXiv 2022.03) Few-Shot Object Detection with Fully Cross-Transformer, [Paper]

  • (arXiv 2022.03) Unified Transformer Tracker for Object Tracking, [Paper]

  • (arXiv 2022.03) X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval, [Paper], [Code]

  • (arXiv 2022.03) Fine-tuning Image Transformers using Learnable Memory, [Paper]

  • (arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image Inpainting, [Paper], [Code]

  • (arXiv 2022.03) mc-BEiT: Multi-choice Discretization for Image BERT Pre-training, [Paper]

  • (arXiv 2022.03) End-to-End Transformer Based Model for Image Captioning, [Paper]

  • (arXiv 2022.03) Hybrid Routing Transformer for Zero-Shot Learning, [Paper]

  • (arXiv 2022.03) TREATMENT LEARNING TRANSFORMER FOR NOISY IMAGE CLASSIFICATION, [Paper]

  • (arXiv 2022.03) Do Vision-Language Pretrained Models Learn Primitive Concepts?, [Paper]

  • (arXiv 2022.03) Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs, [Paper]

  • (arXiv 2022.03) SepViT: Separable Vision Transformer, [Paper]

  • (arXiv 2022.03) MatteFormer: Transformer-Based Image Matting via Prior-Tokens, [Paper], [Code]

  • (arXiv 2022.03) Feature Selective Transformer for Semantic Image Segmentation, [Paper]

  • (arXiv 2022.03) Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos, [Paper], [Code]

  • (arXiv 2022.03) RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution, [Paper], [Code]

  • (arXiv 2022.03) Single-Stream Multi-Level Alignment for Vision-Language Pretraining, [Paper]

  • (arXiv 2022.03) Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers, [Paper], [Code]

  • (arXiv 2022.03) Collaborative Transformers for Grounded Situation Recognition, [Paper], [Code]

  • (arXiv 2022.03) Object Memory Transformer for Object Goal Navigation, [Paper]

  • (arXiv 2022.03) Brain-inspired Multilayer Perceptron with Spiking Neurons, [Paper], [Code]

  • (arXiv 2022.03) HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network, [Paper], [Code]

  • (arXiv 2022.03) REGTR: End-to-end Point Cloud Correspondences with Transformers, [Paper], [Code]

  • (arXiv 2022.03) Automated Progressive Learning for Efficient Training of Vision Transformers, [Paper]

  • (arXiv 2022.03) Stratified Transformer for 3D Point Cloud Segmentation, [Paper], [Code]

  • (arXiv 2022.03) NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge, [Paper]

  • (arXiv 2022.03) FACIAL EXPRESSION RECOGNITION WITH SWIN TRANSFORMER, [Paper]

  • (arXiv 2022.03) Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness, [Paper]

  • (arXiv 2022.03) Efficient Visual Tracking via Hierarchical Cross-Attention Transformer, [Paper], [Code]

  • (arXiv 2022.03) High-Performance Transformer Tracking, [Paper], [Code]

  • (arXiv 2022.03) RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers, [Paper]

  • (arXiv 2022.03) Multi-modal Multi-label Facial Action Unit Detection with Transformer, [Paper]

  • (arXiv 2022.03) MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection, [Paper], [Code]

  • (arXiv 2022.03) Text to Mesh Without 3D Supervision Using Limit Subdivision, [Paper], [Project]

  • (arXiv 2022.03) GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection, [Paper], [Code]

  • (arXiv 2022.03) CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation, [Paper]

  • (arXiv 2022.03) FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks, [Paper], [Code]

  • (arXiv 2022.03) Vision Transformer Compression with Structured Pruning and Low Rank Approximation, [Paper]

  • (arXiv 2022.03) Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers, [Paper]

  • (arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection, [Paper]

  • (arXiv 2022.03) Learning Patch-to-Cluster Attention in Vision Transformer, [Paper]

  • (arXiv 2022.03) Visual Prompt Tuning, [Paper]

  • (arXiv 2022.03) Training-free Transformer Architecture Search, [Paper]

  • (arXiv 2022.03) VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, [Paper], [Code]

  • (arXiv 2022.03) METAMORPH: LEARNING UNIVERSAL CONTROLLERS WITH TRANSFORMERS, [Paper], [Project]

  • (arXiv 2022.03) A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning, [Paper]

  • (arXiv 2022.03) Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers, [Paper], [Project]

  • (arXiv 2022.03) Associating Objects with Scalable Transformers for Video Object Segmentation, [Paper], [[Project]](https://github.com/z-x-yang/AOT0

  • (arXiv 2022.03) HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation, [Paper], [Code]

  • (arXiv 2022.03) Learning to generate line drawings that convey geometry and semantics, [Paper], [Project]

  • (arXiv 2022.03) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection, [Paper], [Code]

  • (arXiv 2022.03) AIMusicGuru: Music Assisted Human Pose Correction, [Paper]

  • (arXiv 2022.03) What to Hide from Your Students: Attention-Guided Masked Image Modeling, [Paper]

  • (arXiv 2022.03) Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer, [Paper]

  • (arXiv 2022.03) ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator, [Paper]

  • (arXiv 2022.03) Keypoints Tracking via Transformer Networks, [Paper], [Code]

  • (arXiv 2022.03) Beyond Fixation: Dynamic Window Visual Transformer, [Paper], [Code]

  • (arXiv 2022.03) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors, [Paper]

  • (arXiv 2022.03) Self-supervised Video-centralised Transformer for Video Face Clustering, [Paper]

  • (arXiv 2022.03) Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [Paper]

  • (arXiv 2022.03) Global Tracking Transformers, [Paper], [Code]

  • (arXiv 2022.03) Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer, [Paper], [Code]

  • (arXiv 2022.03) QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation, [Paper]

  • (arXiv 2022.03) Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos, [Paper], [Project]

  • (arXiv 2022.03) GradViT: Gradient Inversion of Vision Transformers, [Paper], [Code]

  • (arXiv 2022.03) Mask Usage Recognition using Vision Transformer with Transfer Learning and Data Augmentation, [Paper]

  • (arXiv 2022.03) Under the Hood of Transformer Networks for Trajectory Forecasting, [Paper]

  • (arXiv 2022.03) Open-Vocabulary DETR with Conditional Matching, [Paper]

  • (arXiv 2022.03) Meta-attention for ViT-backed Continual Learning, [Paper], [Code]

  • (arXiv 2022.03) CNNs and Transformers Perceive Hybrid Images Similar to Humans, [Paper], [Code]

  • (arXiv 2022.03) Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory, [Paper], [Code]

  • (arXiv 2022.03) Affective Feedback Synthesis Towards Multimodal Text and Image Data, [Paper]

  • (arXiv 2022.03) ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers, [Paper]

  • (arXiv 2022.03) CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration, [Paper]

  • (arXiv 2022.03) Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds, [Paper], [Code]

  • (arXiv 2022.03) HIPA: Hierarchical Patch Transformer for Single Image Super Resolution, [Paper]

  • (arXiv 2022.03) DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition, [Paper], [Code]

  • (arXiv 2022.03) MixFormer: End-to-End Tracking with Iterative Mixed Attention, [Paper], [Code]

  • (arXiv 2022.03) PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark, [Paper], [Code]

  • (arXiv 2022.03) Relationformer: A Unified Framework for Image-to-Graph Generation, [Paper], [Code]

  • (arXiv 2022.03) CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning, [Paper], [Code]

  • (arXiv 2022.03) Hyperbolic Vision Transformers: Combining Improvements in Metric Learning, [Paper], [Code]

  • (arXiv 2022.03) MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer, [Paper], [Code]

  • (arXiv 2022.03) Transformer-based HTR for Historical Documents, [Paper]

  • (arXiv 2022.03) simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers, [Paper], [Code]

  • (arXiv 2022.03) End-to-End Human-Gaze-Target Detection with Transformers, [Paper]

  • (arXiv 2022.03) End-to-End Video Text Spotting with Transformer, [Paper], [Code]

  • (arXiv 2022.03) Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation, [Paper], [Code]

  • (arXiv 2022.03) V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer, [Paper]

  • (arXiv 2022.03) LocATe: End-to-end Localization of Actions in 3D with Transformers, [Paper]

  • (arXiv 2022.03) AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder, [Paper]

  • (arXiv 2022.03) ViM: Out-Of-Distribution with Virtual-logit Matching, [Paper], [Code]

  • (arXiv 2022.03) ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer, [Paper]

  • (arXiv 2022.03) Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows, [Paper]

  • (arXiv 2022.03) Vision Transformer with Convolutions Architecture Search, [Paper]

  • (arXiv 2022.03) Cascade Transformers for End-to-End Person Search, [Paper], [Code]

  • (arXiv 2022.03) CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance, [Paper]

  • (arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for Feature Matching, [Paper], [Code]

  • (arXiv 2022.03) Local-Global Context Aware Transformer for Language-Guided Video Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Three things everyone should know about Vision Transformers, [Paper]

  • (arXiv 2022.03) Are Vision Transformers Robust to Spurious Correlations? [Paper], [Code]

  • (arXiv 2022.03) MUTUAL GENERATIVE TRANSFORMER LEARNING FOR CROSS-VIEW GEO-LOCALIZATION, [Paper]

  • (arXiv 2022.03) DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training, [Paper]

  • (arXiv 2022.03) Semantic-aligned Fusion Transformer for One-shot Object Detection, [Paper]

  • (arXiv 2022.03) UNIMO-2: End-to-End Unified Vision-Language Grounded Learning, [Paper], [Code]

  • (arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning, [Paper], [Code]

  • (arXiv 2022.03) One-Shot Adaptation of GAN in Just One CLIP, [Paper]

  • (arXiv 2022.03) PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation, [Paper]

  • (arXiv 2022.03) PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer, [Paper]

  • (arXiv 2022.03) Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image, [Paper], [Code]

  • (arXiv 2022.03) Transframer: Arbitrary Frame Prediction with Generative Models, [Paper]

  • (arXiv 2022.03) Towards Data-Efficient Detection Transformers, [Paper], [Code]

  • (arXiv 2022.03) Bi-directional Object-Context Prioritization Learning for Saliency Ranking, [Paper], [Code]

  • (arXiv 2022.03) PATCH-FOOL: ARE VISION TRANSFORMERS ALWAYS ROBUST AGAINST ADVERSARIAL PERTURBATIONS? [Paper], [Code]

  • (arXiv 2022.03) WegFormer: Transformers for Weakly Supervised Semantic Segmentation, [Paper]

  • (arXiv 2022.03) Open Set Recognition using Vision Transformer with an Additional Detection Head, [Paper], [Code]

  • (arXiv 2022.03) UNIFIED VISUAL TRANSFORMER COMPRESSION, [Paper], [Code]

  • (arXiv 2022.03) Towards Practical Certifiable Patch Defense with Vision Transformer, [Paper]

  • (arXiv 2022.03) EDTER: Edge Detection with Transformer, [Paper], [Code]

  • (arXiv 2022.03) ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation, [Paper]

  • (arXiv 2022.03) Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution, [Paper]

  • (arXiv 2022.03) Revitalize Region Feature for Democratizing Video-Language Pre-training, [Paper], [Code]

  • (arXiv 2022.03) Inverted Pyramid Multi-task Transformer for Dense Scene Understanding, [Paper]

  • (arXiv 2022.03) Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Style Transformer for Image Inversion and Editing, [Paper], [Code]

  • (arXiv 2022.03) MotionCLIP: Exposing Human Motion Generation to CLIP Space, [Paper], [Project]

  • (arXiv 2022.03) The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy, [Paper], [Code]

  • (arXiv 2022.03) Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation, [Paper]

  • (arXiv 2022.03) Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning, [Paper], [Code]

  • (arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting, [Paper]

  • (arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection, [Paper]

  • (arXiv 2022.03) DATR: Domain-adaptive transformer for multi-domain landmark detection, [Paper]

  • (arXiv 2022.03) EventFormer: AU Event Transformer for Facial Action Unit Event Detection, [Paper]

  • (arXiv 2022.03) Accelerating DETR Convergence via Semantic-Aligned Matching, [Paper], [Code]

  • (arXiv 2022.03) All in One: Exploring Unified Video-Language Pre-training, [Paper], [Code]

  • (arXiv 2022.03) CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment, [Paper]

  • (arXiv 2022.03) EIT: Efficiently Lead Inductive Biases to ViT, [Paper], [Code]

  • (arXiv 2022.03) Self-Promoted Supervision for Few-Shot Transformer, [Paper], [Code]

  • (arXiv 2022.03) MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization, [Paper]

  • (arXiv 2022.03) Disentangled Representation Learning for Text-Video Retrieval, [Paper]

  • (arXiv 2022.03) TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding, [Paper], [Dataset]

  • (arXiv 2022.03) Visualizing and Understanding Patch Interactions in Vision Transformer, [Paper]

  • (arXiv 2022.03) ANTI-OVERSMOOTHING IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE, [Paper], [Code]

  • (arXiv 2022.03) Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision, [Paper], [Code]

  • (arXiv 2022.03) ActiveMLP: An MLP-like Architecture with Active Token Mixer, [Paper], [Code]

  • (arXiv 2022.03) Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding, [Paper]

  • (arXiv 2022.03) TrueType Transformer: Character and Font Style Recognition in Outline Format, [Paper]

  • (arXiv 2022.03) LOOPITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval, [Paper]

  • (arXiv 2022.03) MVP: Multimodality-guided Visual Pre-training, [Paper]

  • (arXiv 2022.03) DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting, [Paper]

  • (arXiv 2022.03) Multi-Modal Mixup for Robust Fine-tuning, [Paper]

  • (arXiv 2022.03) AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant, [Paper], [Project]

  • (arXiv 2022.03) Coarse-to-Fine Vision Transformer, [Paper], [Code]

  • (arXiv 2022.03) Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, [Paper]

  • (arXiv 2022.03) WAVEMIX: RESOURCE-EFFICIENT TOKEN MIXING FOR IMAGES, [Paper]

  • (arXiv 2022.03) VOVIT: LOW LATENCY GRAPH-BASED AUDIO-VISUAL VOICE SEPARATION TRANSFORMER, [Paper], [Code]

  • (arXiv 2022.03) Graph Attention Transformer Network for Multi-Label Image Classification, [Paper]

  • (arXiv 2022.03) EDGEFORMER: IMPROVING LIGHT-WEIGHT CONVNETS BY LEARNING FROM VISION TRANSFORMERS, [Paper], [Code]

  • (arXiv 2022.03) Skating-Mixer: Multimodal MLP for Scoring Figure Skating, [Paper]

  • (arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention, [Paper]

  • (arXiv 2022.03) CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction, [Paper]

  • (arXiv 2022.03) Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning, [Paper]

  • (arXiv 2022.03) ChiTransformer: Towards Reliable Stereo from Cues, [Paper]

  • (arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation,** Co-Saliency Detection** and Video Salient Object Detection, [Paper], [Code]

  • (arXiv 2022.03) Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction, [Paper]

  • (arXiv 2022.03) CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers, [Paper], [Code]

  • (arXiv 2022.03) Multiscale Transformer for Hyperspectral Image Classification, [Paper]

  • (arXiv 2022.03) Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning, [Paper], [Code]

  • (arXiv 2022.03) Autoregressive Image Generation using Residual Quantization, [Paper]

  • (arXiv 2022.03) CONTEXTFORMER: A TRANSFORMER WITH SPATIO-CHANNEL ATTENTION FOR CONTEXT MODELING IN LEARNED IMAGE COMPRESSION, [Paper]

  • (arXiv 2022.03) Patch Similarity Aware Data-Free Quantization for Vision Transformers, [Paper]

  • (arXiv 2022.03) ViT-P: Rethinking Data-efficient Vision Transformers from Locality, [Paper]

  • (arXiv 2022.03) DIT: SELF-SUPERVISED PRE-TRAINING FOR DOCUMENT IMAGE TRANSFORMER, [Paper]

  • (arXiv 2022.03) Towards Efficient and Scalable Sharpness-Aware Minimization, [Paper]

  • (arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening, [Paper], [Code]

  • (arXiv 2022.03) UVCGAN: UNET VISION TRANSFORMER CYCLE-CONSISTENT GAN FOR UNPAIRED IMAGE-TO-IMAGE TRANSLATION, [Paper], [Code]

  • (arXiv 2022.03) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, [Paper], [Code]

  • (arXiv 2022.03) PANFORMER: A TRANSFORMER BASED MODEL FOR PAN-SHARPENING, [Paper], [Code]

  • (arXiv 2022.03) Multi-class Token Transformer for Weakly Supervised Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Cross Language Image Matching for Weakly Supervised Semantic Segmentation, [Paper]

  • (arXiv 2022.03) Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers, [Paper], [Code]

  • (arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, [Paper], [Code]

  • (arXiv 2022.03) MetaFormer : A Unified Meta Framework for Fine-Grained Recognition, [Paper], [Code]

  • (arXiv 2022.03) Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language, [Paper]

  • (arXiv 2022.03) Knowledge Amalgamation for Object Detection with Transformers, [Paper]

  • (arXiv 2022.03) Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos, [Paper]

  • (arXiv 2022.03) Modeling Coreference Relations in Visual Dialog, [Paper], [Code]

  • (arXiv 2022.03) VITRANSPAD: VIDEO TRANSFORMER USING CONVOLUTION AND SELF-ATTENTION FOR FACE PRESENTATION ATTACK DETECTION, [Paper]

  • (arXiv 2022.03) Multi-Tailed Vision Transformer for Efficient Inference, [Paper]

  • (arXiv 2022.03) Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation, [Paper], [Code]

  • (arXiv 2022.03) Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology, [Paper]

  • (arXiv 2022.03) LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network, [Paper], [Code]

  • (arXiv 2022.03) LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction, [Paper]

  • (arXiv 2022.03) DCT-Former: Efficient Self-Attention with Discrete Cosine Transform, [Paper], [Code]

  • (arXiv 2022.03) Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment, [Paper]

  • (arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in Point Cloud, [Paper]

  • (arXiv 2022.03) CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP, [Paper]

  • (arXiv 2022.03) MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video, [Paper]

  • (arXiv 2022.03) X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning, [Paper]

  • (arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification, [Paper]

  • (arXiv 2022.03) DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation, [Paper]

  • (arXiv 2022.03) D_2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention, [Paper]

  • (arXiv 2022.03) Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding, [Paper], [Code]

  • (arXiv 2022.03) Self-supervised Transformer for Deepfake Detection, [Paper]

  • (arXiv 2022.03) Aggregated Pyramid Vision Transformer: Splittransform-merge Strategy for Image Recognition without Convolutions, [Paper]

  • (arXiv 2022.03) TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration, [Paper], [Code]

  • (arXiv 2022.03) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising, [Paper], [Code]

  • (arXiv 2022.03) Protecting Celebrities with Identity Consistency Transformer, [Paper]

  • (arXiv 2022.03) Masked Visual Pre-training for Motor Control, [Paper], [Project]

  • (arXiv 2022.03) NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, [Paper], [Code]

  • (arXiv 2022.03) Conditional Prompt Learning for Vision-Language Models, [Paper], [Code]

  • (arXiv 2022.03) Lane Detection with Versatile AtrousFormer and Local Semantic Guidance, [Paper]

  • (arXiv 2022.03) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]

  • (arXiv 2022.03) Forecasting Characteristic 3D Poses of Human Actions , [Paper], [Code]

2022.02

  • (arXiv 2022.02) Bayesian Structure Learning with Generative Flow Networks, [Paper]

  • (arXiv 2022.02) Towards Unsupervised Domain Adaptation via Domain-Transformer, [Paper]

  • (arXiv 2022.02) An End-to-End Transformer Model for Crowd Localization, [Paper]

  • (arXiv 2022.02) Instantaneous Physiological Estimation using Video Transformers, [Paper], [Code]

  • (arXiv 2022.02) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation, [Paper], [Code]

  • (arXiv 2022.02) ATTENTION ENABLES ZERO APPROXIMATION ERROR, [Paper]

  • (arXiv 2022.02) When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection, [Paper], [Code]

  • (arXiv 2022.02) AUTO-SCALING VISION TRANSFORMERS WITHOUT TRAINING, [Paper], [Code]

  • (arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation, [Paper], [Project]

  • (arXiv 2022.02) LEARNING TO MERGE TOKENS IN VISION TRANSFORMERS, [Paper]

  • (arXiv 2022.02) ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers, [Paper], [Code]

  • (arXiv 2022.02) SELF-SUPERVISED TRANSFORMERS FOR UNSUPERVISED OBJECT DISCOVERY USING NORMALIZED CUT, [Paper], [Project]

  • (arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis, [Paper]

  • (arXiv 2022.02) CaMEL: Mean Teacher Learning for Image Captioning, [Paper]

  • (arXiv 2022.02) Hierarchical Perceiver, [Paper]

  • (arXiv 2022.02) Movies2Scenes: Learning Scene Representations Using Movie Similarities, [Paper]

  • (arXiv 2022.02) GroupViT: Semantic Segmentation Emerges from Text Supervision, [Paper], [[Code

  • (arXiv 2022.02) Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer, [Paper], [Code]

  • (arXiv 2022.02) Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations, [Paper]

  • (arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond, [Paper]

  • (arXiv 2022.02) PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths, [Paper], [Code]

  • (arXiv 2022.02) DataMUX: Data Multiplexing for Neural Networks, [Paper], [Code]

  • (arXiv 2022.02) On Guiding Visual Attention with Language Specification, [Paper]

  • (arXiv 2022.02) SPATIO-TEMPORAL OUTDOOR LIGHTING AGGREGATION ON IMAGE SEQUENCES USING TRANSFORMER NETWORKS, [Paper]

  • (arXiv 2022.02) MISINFORMATION DETECTION IN SOCIAL MEDIA VIDEO POSTS, [Paper]

  • (arXiv 2022.02) Can Deep Learning be Applied to Model-Based Multi-Object Tracking? [Paper]

  • (arXiv 2022.02) NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA TOKEN REORGANIZATIONS, [Paper], [Code]

  • (arXiv 2022.02) ActionFormer: Localizing Moments of Actions with Transformers, [Paper], [Code]

  • (arXiv 2022.02) One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones, [Paper]

  • (arXiv 2022.02) XAI for Transformers: Better Explanations through Conservative Propagation, [Paper]

  • (arXiv 2022.02) MeshLeTemp: Leveraging the Learnable Vertex-Vertex Relationship to Generalize Human Pose and Mesh Reconstruction for In-the-Wild Scenes, [Paper]

  • (arXiv 2022.02) ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer, [Paper]

  • (arXiv 2022.02) Hyper-relationship Learning Network for Scene Graph Generation, [Paper]

  • (arXiv 2022.02) CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval, [Paper]

  • (arXiv 2022.02) Flowformer: Linearizing Transformers with Conservation Flows, [Paper]

  • (arXiv 2022.02) DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following, [Paper], [Code]

  • (arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [Paper]

  • (arXiv 2022.02) Geometric Transformer for Fast and Robust Point Cloud Registration, [Paper], [Code]

  • (arXiv 2022.02) I-Tuning: Tuning Language Models with Image for Caption Generation, [[Paper]](I-Tuning: Tuning Language Models with Image for Caption Generation)

  • (arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval, [Paper], [Code]

  • (arXiv 2022.02) Visual Acoustic Matching, [Paper]

  • (arXiv 2022.02) LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling, [Paper]

  • (arXiv 2022.02) BViT: Broad Attention based Vision Transformer, [Paper], [Code]

  • (arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [Paper]

  • (arXiv 2022.02) Domain Adaptation via Prompt Learning, [Paper]

  • (arXiv 2022.02) Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs, [Paper], [Code]

  • (arXiv 2022.02) Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework, [Paper], [Project]

  • (arXiv 2022.02) HOW DO VISION TRANSFORMERS WORK? [Paper], [Code]

  • (arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning, [Paper], [Code]

  • (arXiv 2022.02) CLIPasso: Semantically-Aware Object Sketching, [Paper], [Code]

  • (arXiv 2022.02) Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer, [Paper]

  • (arXiv 2022.02) DEEP SOCCER CAPTIONING WITH TRANSFORMER: DATASET, SEMANTICS-RELATED LOSSES, AND MULTI-LEVEL EVALUATION, [Paper], [Project]

  • (arXiv 2022.02) ENTROFORMER: A TRANSFORMER-BASED ENTROPY MODEL FOR LEARNED IMAGE COMPRESSION, [Paper], [Code]

  • (arXiv 2022.02) Image Difference Captioning with Pre-training and Contrastive Learning, [Paper], [Code]

  • (arXiv 2022.02) MaskGIT: Masked Generative Image Transformer, [Paper]

  • (arXiv 2022.02) Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning, [Paper]

  • (arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [Paper]

  • (arXiv 2022.02) Conditional Motion In-betweening, [Paper], [Code]

  • (arXiv 2022.02) Memory-based gaze prediction in deep imitation learning for robot manipulation, [Paper]

  • (arXiv 2022.02) Spherical Transformer, [Paper]

  • (arXiv 2022.02) OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context, [Paper]

  • (arXiv 2022.02) The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning, [Paper], [Project]

  • (arXiv 2022.02) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]

  • (arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]

  • (arXiv 2022.02) TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer, [Paper]

  • (arXiv 2022.02) The devil is in the labels: Semantic segmentation from sentences, [Paper]

  • (arXiv 2022.02) Webly Supervised Concept Expansion for General Purpose Vision Models, [Paper], [Project]

  • (arXiv 2022.02) VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG, [Paper]

  • (arXiv 2022.02) UNIFYING ARCHITECTURES, TASKS, AND MODALITIES THROUGH A SIMPLE SEQUENCE-TO-SEQUENCE LEARNING FRAMEWORK, [Paper], [Code]

  • (arXiv 2022.02) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper]

  • (arXiv 2022.02) TRANSDREAMER: REINFORCEMENT LEARNING WITH TRANSFORMER WORLD MODELS, [Paper]

  • (arXiv 2022.02) Vision-Language Pre-Training with Triple Contrastive Learning, [Paper], [Code]

  • (arXiv 2022.02) Corrupted Image Modeling for Self-Supervised Visual Pre-Training, [Paper]

  • (arXiv 2022.02) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, [Paper], [Code]

  • (arXiv 2022.02) DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators, [Paper]

  • (arXiv 2022.02) Interactron: Embodied Adaptive Object Detection, [Paper]

  • (arXiv 2022.02) Local Feature Matching with Transformers for low-end devices LoFTR method adaptation approach, [Paper], [Code]

  • (arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]

  • (arXiv 2022.02) Can Transformers be Strong Treatment Effect Estimators?, [Paper]

  • (arXiv 2022.02) Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers, [Paper]

  • (arXiv 2022.02) Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics, [Paper], [Code]

2022.01

  • (arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [Paper]

  • (arXiv 2022.01) DynaMixer: A Vision MLP Architecture with Dynamic Mixing, [Paper]

  • (arXiv 2022.01) VRT: A Video Restoration Transformer, [Paper], [Code]

  • (arXiv 2022.01) DAB-DETR: DYNAMIC ANCHOR BOXES ARE BETTER QUERIES FOR DETR, [Paper], [Code]

  • (arXiv 2022.01) Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations, [Paper]

  • (arXiv 2022.01) MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment, [Paper]

  • (arXiv 2022.01) VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training, [Paper]

  • (arXiv 2022.01) BOAT: Bilateral Local Attention Vision Transformer, [Paper]

  • (arXiv 2022.01) GRAPH SELF-ATTENTION FOR LEARNING GRAPH REPRESENTATION WITH TRANSFORMER, [Paper]

  • (arXiv 2022.01) Aggregating Global Features into Local Vision Transformer, [Paper], [Code]

  • (arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [Paper]

  • (arXiv 2022.01) Generalised Image Outpainting with U-Transformer, [Paper]

  • (arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper]

  • (arXiv 2022.01) DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer, [Paper]

  • (arXiv 2022.01) Pre-Trained Language Transformers are Universal Image Classifiers, [Paper]

  • (arXiv 2022.01) Explore and Match: End-to-End Video Grounding with Transformer, [Paper]

  • (arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [Paper]

  • (arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals, [Paper]

  • (arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation, [Paper], [Project]

  • (arXiv 2022.01) CONVOLUTIONAL XFORMERS FOR VISION, [Paper], [Code]

  • (arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [Paper], [Code]

  • (arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]

  • (arXiv 2022.01) SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering, [Paper]

  • (arXiv 2022.01) DUAL-TASKS SIAMESE TRANSFORMER FRAMEWORK FOR BUILDING DAMAGE ASSESSMENT, [Paper]

  • (arXiv 2022.01) When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism, [Paper], [Code]

  • (arXiv 2022.01) Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation, [Paper]

  • (arXiv 2022.01) Training Vision Transformers with Only 2040 Images, [Paper]

  • (arXiv 2022.01) Learning To Recognize Procedural Activities with Distant Supervision, [Paper]

  • (arXiv 2022.01) EVALUATING LANGUAGE-BIASED IMAGE CLASSIFICATION BASED ON SEMANTIC REPRESENTATIONS, [Paper]

  • (arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks, [Paper]

  • (arXiv 2022.01) UniFormer: Unifying Convolution and Self-attention for Visual Recognition, [Paper], [Code]

  • (arXiv 2022.01) Patches Are All You Need? [Paper], [Code]

  • (arXiv 2022.01) Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval, [Paper]

  • (arXiv 2022.01) LEARNING TO ACT WITH AFFORDANCE-AWARE MULTIMODAL NEURAL SLAM, [Paper]

  • (arXiv 2022.01) Visual Information Guided Zero-Shot Paraphrase Generation, [Paper]

  • (arXiv 2022.01) TerViT: An Efficient Ternary Vision Transformer, [Paper]

  • (arXiv 2022.01) End-to-end Generative Pretraining for Multimodal Video Captioning, [Paper]

  • (arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper], [Project]

  • (arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, [Paper]

  • (arXiv 2022.01) The CLEAR Benchmark: Continual LEArning on Real-World Imagery, [Paper], [Project]

  • (arXiv 2022.01) ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues, [Paper]

  • (arXiv 2022.01) Cross-modal Contrastive Distillation for Instructional Activity Anticipation, [Paper]

  • (arXiv 2022.01) Transformers in Action: Weakly Supervised Action Segmentation, [Paper]

  • (arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer, [Paper]

  • (arXiv 2022.01) CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks, [Paper]

  • (arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]

  • (arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference, [Paper]

  • (arXiv 2022.01) Motion Inbetweening via Deep ∆-Interpolator, [Paper]

  • (arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper]

  • (arXiv 2022.01) GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events, [Paper]

  • (arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]

  • (arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]

  • (arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation, [Paper], [Project]

  • (arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers*, [Paper]

  • (arXiv 2022.01) SWINUNET3D - A HIERARCHICAL ARCHITECTURE FOR DEEP TRAFFIC PREDICTION USING SHIFTED WINDOW TRANSFORMERS, [Paper], [Code]

  • (arXiv 2022.01) SWIN-POSE: SWIN TRANSFORMER BASED HUMAN POSE ESTIMATION, [Paper]

  • (arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [Paper], [Project]

  • (arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving Hashing, [Paper]

  • (arXiv 2022.01) LANGUAGE-DRIVEN SEMANTIC SEGMENTATION, [Paper], [Code]

  • (arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]

  • (arXiv 2022.01) ImageSubject: A Large-scale Dataset for Subject Detection, [Paper]

  • (arXiv 2022.01) Detecting Twenty-thousand Classes using Image-level Supervision, [Paper], [Code]

  • (arXiv 2022.01) Generalized Category Discovery, [Paper], [Code]

  • (arXiv 2022.01) Video Summarization Based on Video-text Modelling, [Paper]

  • (arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]

  • (arXiv 2022.01) QUADTREE ATTENTION FOR VISION TRANSFORMERS, [Paper], [Code]

  • (arXiv 2022.01) A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval, [Paper], [Project]

  • (arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]

  • (arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]

  • (arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]

  • (arXiv 2022.01) Multiview Transformers for Video Recognition, [Paper]

  • (arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED FEW-SHOT LEARNING, [Paper]

  • (arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT SPATIOTEMPORAL REPRESENTATION LEARNING, [Paper], [Code]

  • (arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper], [Project]

  • (arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers, [Paper]

  • (arXiv 2022.01) CLIP-Event: Connecting Text and Images with Event Structures, [Paper], [Code]

  • (arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training, [Paper]

  • (arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]

  • (arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]

  • (arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [Paper]

  • (arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]

  • (arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring, [Paper]

  • (arXiv 2022.01) Stochastic Layers in Vision Transformers, [Paper]

  • (arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR BIDIRECTIONAL VISION-LANGUAGE GENERATION, [Paper]

  • (arXiv 2022.01) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer, [Paper], [Code]

  • (arXiv 2022.01) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]

  • (arXiv 2022.01) Persformer: A Transformer Architecture for Topological Machine Learning, [Paper]

  • (arXiv 2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]

  • (arXiv 2022.01) Language as Queries for Referring Video Object Segmentation, [Paper], [Code]

  • (arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]

  • (arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR CHANGE DETECTION, [Paper], [Code]

  • (arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]

  • (arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer, [Paper], [Project]

  • (arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

2021.12

  • (arXiv 2021.12) Multi-Dimensional Model Compression of Vision Transformer, [Paper]

  • (arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]

  • (arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Atention, [Paper], [Code]

  • (arXiv 2021.12) APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers, [Paper]

  • (arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper]

  • (arXiv 2021.12) Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?, [Paper]

  • (arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]

  • (arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [Paper]

  • (arXiv 2021.12) StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2, [Paper], [Code]

  • (arXiv 2021.12) A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, [Paper], [Code]

  • (arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]

  • (arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH SLIDING WINDOWS, [Paper], [Code]

  • (arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]

  • (arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]

  • (arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]

  • (arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]

  • (arXiv 2021.12) ViR: the Vision Reservoir, [Paper]

  • (arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]

  • (arXiv 2021.12) Open-Vocabulary Image Segmentation, [Paper]

  • (arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]

  • (arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]

  • (arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]

  • (arXiv 2021.12) Fine-grained Multi-Modal Self-Supervised Learning, [Paper]

  • (arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper], [Code]

  • (arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes, [Paper]

  • (arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input Adaptation, [Paper]

  • (arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]

  • (arXiv 2021.12) Contrastive Object Detection Using Knowledge Graph Embeddings, [Paper]

  • (arXiv 2021.12) RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality, [Paper], [Code]

  • (arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]

  • (arXiv 2021.12) MPViT : Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]

  • (arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]

  • (arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]

  • (arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]

  • (arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]

  • (arXiv 2021.12) Tell me what you see: A zero-shot action recognition method based on natural language descriptions, [Paper], [Code]

  • (arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]

  • (arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [Paper]

  • (arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [Paper]

  • (arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]

  • (arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]

  • (arXiv 2021.12) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, [Paper], [Code]

  • (arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]

  • (arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]

  • (arXiv 2021.12) Align and Prompt: Video-and-Language Pre-training with Entity Prompts, [Paper], [Code]

  • (arXiv 2021.12) DATA EFFICIENT LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION WITH OPTIMAL TRANSPORT DISTILLATION, [Paper]

  • (arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]

  • (arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]

  • (arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources, [Paper]

  • (arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper]

  • (arXiv 2021.12) How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [Paper]

  • (arXiv 2021.12) Learning to Prompt for Continual Learning, [Paper], [Code]

  • (arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper], [Code]

  • (arXiv 2021.12) Dense Video Captioning Using Unsupervised Semantic Information, [Paper], [Code]

  • (arXiv 2021.12) Looking Outside the Box to Ground Language in 3D Scenes, [Paper], [Code]

  • (arXiv 2021.12) RegionCLIP: Region-based Language-Image Pretraining, [Paper], [Code]

  • (arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper]

  • (arXiv 2021.12) Masked Feature Prediction for Self-Supervised Visual Pre-Training, [Paper]

  • (arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning, [Paper]

  • (arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]

  • (arXiv 2021.12) Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos, [Paper], [Code]

  • (arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]

  • (arXiv 2021.12) QAHOI: Query-Based Anchors for Human-Object Interaction Detection, [Paper], [Code]

  • (arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]

  • (arXiv 2021.12) CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations, [Paper]

  • (arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]

  • (arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors, [Paper], [Project]

  • (arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]

  • (arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]

  • (arXiv 2021.12) COMPOSER: Compositional Learning of Group Activity in Videos, [Paper]

  • (arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]

  • (arXiv 2021.12) Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection, [Paper]

  • (arXiv 2021.12) SVIP: Sequence VerIfication for Procedures in Videos, [Paper]

  • (arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]

  • (arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper], [Code]

  • (arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]

  • (arXiv 2021.12) PartGlot: Learning Shape Part Segmentation from Language Reference Games, [Paper]

  • (arXiv 2021.12) Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network, [Paper]

  • (arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR TEXT-BASED PERSON SEARCH, [Paper]

  • (arXiv 2021.12) L-Verse: Bidirectional Generation Between Image and Text, [Paper]

  • (arXiv 2021.12) SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY, [Paper]

  • (arXiv 2021.12) Are Vision Transformers Robust to Patch Perturbations? [Paper]

  • (arXiv 2021.12) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]

  • (arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]

  • (arXiv 2021.12) MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning, [Paper]

  • (arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]

  • (arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]

  • (arXiv 2021.12) Rethinking the Two-Stage Framework for Grounded Situation Recognition, [Paper], [Code]

  • (arXiv 2021.12) CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions, [Paper]

  • (arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling Attention Map, [Paper]

  • (arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]

  • (arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]

  • (arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training, [Paper], [Code]

  • (arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]

  • (arXiv 2021.12) Grounded Language-Image Pre-training, [Paper], [Code]

  • (arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper]

  • (arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR POINT CLOUD ANALYSIS, [Paper]

  • (arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]

  • (arXiv 2021.12) VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts, [Paper]

  • (arXiv 2021.12) PointCLIP: Point Cloud Understanding by CLIP, [Paper], [Code]

  • (arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]

  • (arXiv 2021.12) DYNAMIC TOKEN NORMALIZATION IMPROVES VISION TRANSFORMER, [Paper], [Code]

  • (arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]

  • (arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]

  • (arXiv 2021.12) Text2Mesh: Text-Driven Neural Stylization for Meshes, [Paper], [Project]

  • (arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]

  • (arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]

  • (arXiv 2021.12) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization, [Paper], [Code]

  • (arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]

  • (arXiv 2021.12) Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer, [Paper], [Code]

  • (arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection, [Paper]

  • (arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper]

  • (arXiv 2021.12) Transformer based trajectory prediction, [Paper]

  • (arXiv 2021.12) Evaluating Transformers for Lightweight Action Recognition, [Paper]

  • (arXiv 2021.12) Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision, [Paper]

  • (arXiv 2021.12) CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification, [Paper]

  • (arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]

  • (arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]

  • (arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers, [Paper]

  • (arXiv 2021.12) Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning, [Paper]

  • (arXiv 2021.12) AUDIO-VISUAL SYNCHRONISATION IN THE WILD, [Paper], [Project]

  • (arXiv 2021.12) Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, [Paper]

  • (arXiv 2021.12) Garment4D: Garment Reconstruction from Point Cloud Sequences, [Paper], [Code]

  • (arXiv 2021.12) Locally Shifted Attention**** With Early Global Integration, [Paper], [Code]

  • (arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]

  • (arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Project]

  • (arXiv 2021.12) HairCLIP: Design Your Hair by Text and Reference Image, [Paper], [Project]

  • (arXiv 2021.12) CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, [Paper], [Code]

  • (arXiv 2021.12) A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code], [Dataset]

  • (arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper], [Code]

  • (arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]

  • (arXiv 2021.12) Fast Point Transformer, [Paper]

  • (arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to Collect Robotic Task Demonstrations, [Paper], [Project]

  • (arXiv 2021.12) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper]

  • (arXiv 2021.12) PatchFormer: An Efficient Point Transformer with Patch Attention, [Paper]

  • (arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]

  • (arXiv 2021.12) MLP Architectures for Vision-and-Language Modeling: An Empirical Study, [Paper], [Code]

  • (arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for Video Retrieval, [Paper]

  • (arXiv 2021.12) Prompting Visual-Language Models for Efficient Video Understanding, [Paper], [Project]

  • (arXiv 2021.12) FLAVA: A Foundational Language And Vision Alignment Model, [Paper]

  • (arXiv 2021.12) Embedding Arithmetic for Text-driven Image Transformation, [Paper]

  • (arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper]

  • (arXiv 2021.12) Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos, [Paper], [Project]

  • (arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]

  • (arXiv 2021.12) DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, [Paper], [Code]

  • (arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]

  • (arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]

  • (arXiv 2021.12) Zero-Shot Text-Guided Object Generation with Dream Fields, [Paper], [Project]

  • (arXiv 2021.12) Video-Text Pre-training with Learned Regions, [Paper], [Code]

  • (arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]

  • (arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning, [Paper]

  • (arXiv 2021.12) DenseCLIP: Extract Free Dense Labels from CLIP, [Paper]

  • (arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper]

  • (arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]

  • (arXiv 2021.12) Object-Centric Unsupervised Image Captioning, [Paper]

  • (arXiv 2021.12) Vision Pair Learning: An Efficient Training Framework for Image Classification, [Paper]

  • (arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]

  • (arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]

  • (arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]

  • (arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]

  • (arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]

  • (arXiv 2021.12) Human-Object Interaction Detection via Weak Supervision, [Paper]

  • (arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]

  • (arXiv 2021.12) CLIPstyler: Image Style Transfer with a Single Text Condition, [Paper]

  • (arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]

  • (arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]

  • (arXiv 2021.12) Object-aware Video-language Pre-training for Retrieval, [Paper], [Code]

2021.11

  • (arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]

  • (arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, [Paper]

  • (arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]

  • (arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]

  • (arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]

  • (arXiv 2021.11) ADAPTIVE FOURIER NEURAL OPERATORS: EFFICIENT TOKEN MIXERS FOR TRANSFORMERS, [Paper]

  • (arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]

  • (arXiv 2021.11) DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, [Paper], [Code]

  • (arXiv 2021.11) Ice hockey player identification via transformers, [Paper]

  • (arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]

  • (arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]

  • (arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]

  • (arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]

  • (arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER ROBUSTNESS, [Paper]

  • (arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning, [Paper]

  • (arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling, [Paper]

  • (arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]

  • (arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]

  • (arXiv 2021.11) ZERO-SHOT CERTIFIED DEFENSE AGAINST ADVERSARIAL PATCHES WITH VISION TRANSFORMERS, [Paper]

  • (arXiv 2021.11) PointMixer: MLP-Mixer for Point Cloud Understanding, [Paper]

  • (arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]

  • (arXiv 2021.11) Florence: A New Foundation Model for Computer Vision, [Paper]

  • (arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]

  • (arXiv 2021.11) Learning to Compose Visual Relations, [Paper], [Project]

  • (arXiv 2021.11) REFERENCE-BASED MAGNETIC RESONANCE IMAGE RECONSTRUCTION USING TEXTURE TRANSFORMER, [Paper]

  • (arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval, [Paper]

  • (arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]

  • (arXiv 2021.11) SimMIM: A Simple Framework for Masked Image Modeling, [Paper], [Code]

  • (arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]

  • (arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]

  • (arXiv 2021.11) ClipCap: CLIP Prefix for Image Captioning, [Paper], [Code]

  • (arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]

  • (arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]

  • (arXiv 2021.11) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, [Paper], [Code]

  • (arXiv 2021.11) Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning, [Paper], [Code]

  • (arXiv 2021.11) Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement, [Paper], [Code]

  • (arXiv 2021.11) Tracking People with 3D Representations, [Paper], [Code]

  • (arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Paper]

  • (arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]

  • (arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code]

  • (arXiv 2021.11) Attention Approximates Sparse Distributed Memory, [Paper]

  • (arXiv 2021.11) SLICED RECURSIVE TRANSFORMER, [Paper], [Code]

  • (arXiv 2021.11) HYBRID BYOL-VIT: EFFICIENT APPROACH TO DEAL WITH SMALL DATASETS, [Paper]

  • (arXiv 2021.11) Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [Paper], [Code]

  • (arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]

  • (arXiv 2021.11) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis, [Paper], [Code]

  • (arXiv 2021.11) Revisiting spatio-temporal layouts for compositional action recognition, [Paper], [Code]

  • (arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in Referential Games, [Paper], [Code]

  • (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]

  • (arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]

  • (arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, [Paper], [Code]

  • (arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning, [Paper], [Project]

  • (arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]

  • (arXiv 2021.11) VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval, [Paper]

  • (arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]

  • (arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]

  • (arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]

  • (arXiv 2021.11) ML-Decoder: Scalable and Versatile Classification Head, [Paper], [Code]

  • (arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]

  • (arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization, [Paper]

  • (arXiv 2021.11) Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation, [Paper]

  • (arXiv 2021.11) Sparse is Enough in Scaling Transformers, [Paper]

  • (arXiv 2021.11) An implementation of the “Guess who?” game using CLIP, [Paper], [Code]

  • (arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]

  • (arXiv 2021.11) A Unified Pruning Framework for Vision Transformers, [Paper]

  • (arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]

  • (arXiv 2021.11) AssistSR: Affordance-centric Question-driven Video Segment Retrieval, [Paper], [Code & Data]

  • (arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation, [Paper], [Code]

  • (arXiv 2021.11) , [Paper]

  • (arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]

  • (arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]

  • (arXiv 2021.11) CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning, [Paper]

  • (arXiv 2021.11) CRIS: CLIP-Driven Referring Image Segmentation, [Paper]

  • (arXiv 2021.11) Shunted Self-Attention via Multi-Scale Token Aggregation, [Paper], [Code]

  • (arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning, [Paper]

  • (arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions, [Paper], [Code]

  • (arXiv 2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]

  • (arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]

  • (arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]

  • (arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]

  • (arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper], [Code]

  • (arXiv 2021.11) LAFITE : Towards Language-Free Training for Text-to-Image Generation, [Paper]

  • (arXiv 2021.11) SPARSE DETR: EFFICIENT END-TO-END OBJECT DETECTION WITH LEARNABLE SPARSITY, [Paper], [Code]

  • (arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]

  • (arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]

  • (arXiv 2021.11) Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic, [Paper], [Code]

  • (arXiv 2021.11) Blended Diffusion for Text-driven Editing of Natural Images, [Paper], [Code]

  • (arXiv 2021.11) Mask Transfiner for High-Quality Instance Segmentation, [Paper], [Code]

  • (arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]

  • (arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper], [COde]

  • (arXiv 2021.11) Towards Tokenized Human Dynamics Representation, [Paper], [Code]

  • (arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]

  • (arXiv 2021.11) VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]

  • (arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]

  • (arXiv 2021.11) MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video, [Paper]

  • (arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]

  • (arXiv 2021.11) Hierarchical Modular Network for Video Captioning, [Paper]

  • (arXiv 2021.11) NU¨WA: Visual Synthesis Pre-training for Neural visUal World creAtion, [Paper], [Code]

  • (arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision MLP, [Paper]

  • (arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper]

  • (arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]

  • (arXiv 2021.11) Scaling Up Vision-Language Pre-training for Image Captioning, [Paper]

  • (arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]

  • (arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]

  • (arXiv 2021.11) RedCaps: Web-curated image-text data created by the people, for the people, [Paper], [Project]

  • (arXiv 2021.11) EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching, [Paper], [Code]

  • (arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper], [Code]

  • (arXiv 2021.11) Vis-TOP: Visual Transformer Overlay Processor, [Paper]

  • (arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]

  • (arXiv 2021.11) Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints, [Paper]

  • (arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]

  • (arXiv 2021.11) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions, [Paper]

  • (arXiv 2021.11) Combined Scaling for Zero-shot Transfer Learning, [Paper]

  • (arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]

  • (arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]

  • (arXiv 2021.11) IBOT: IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER, [Paper], [Code]

  • (arXiv 2021.11) Masked Autoencoders Are Scalable Vision Learners, [Paper]

  • (arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]

  • (arXiv 2021.11) Are Transformers More Robust Than CNNs?, [Paper], [Code]

  • (arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]

  • (arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]

  • (arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]

  • (arXiv 2021.11) VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]

  • (arXiv 2021.11) LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, [Paper], [Project]

  • (arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]

  • (arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]

  • (arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]

2021.10

  • (arXiv 2021.10) Visual Keyword Spotting with Attention, [Paper], [[Project]](Visual Keyword Spotting with Attention)

  • (arXiv 2021.10) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery, [Paper], [Data & Code]

  • (arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval, [Paper], [Code]

  • (arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.10) Scatterbrain: Unifying Sparse and Low-rank Attention Approximation, [Paper]

  • (arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]

  • (arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]

  • (arXiv 2021.10) UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model, [Paper], [Data & Code]

  • (arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]

  • (arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]

  • (arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Project]

  • (arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation, [Paper]

  • (arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED EMOTION RECOGNITION, [Paper]

  • (arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for Visual Re-ranking, [Paper], [Code]

  • (arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]

  • (arXiv 2021.10) IMAGE-BASED CLIP-GUIDED ESSENCE TRANSFER, [Paper], [Code]

  • (arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic Attention, [Paper]

  • (arXiv 2021.10) ILLITERATE DALL·E LEARNS TO COMPOSE, [Paper], [Project], [Code]

  • (arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering, [Paper]

  • (arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]

  • (arXiv 2021.10) Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation, [Paper]

  • (arXiv 2021.10) TRANSFORMER ACCELERATION WITH DYNAMIC SPARSE ATTENTION, [Paper]

  • (arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]

  • (arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization, [Paper]

  • (arXiv 2021.10) StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects, [Paper], [Project]

  • (arXiv 2021.10) Gophormer: Ego-Graph Transformer for Node Classification, [Paper]

  • (arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN GANS, [Paper], [Code]

  • (arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]

  • (arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper], [Code]

  • (arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]

  • (arXiv 2021.10) WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP, [Paper], [Code]

  • (arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]

  • (arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]

  • (arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]

  • (arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]

  • (arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]

  • (arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]

  • (arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]

  • (arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]

  • (arXiv 2021.10) Leveraging MoCap Data for Human Mesh Recovery, [Paper]

  • (arXiv 2021.10) A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models, [Paper]

  • (arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]

  • (arXiv 2021.10) Multimodal Dialogue Response Generation, [Paper]

  • (arXiv 2021.10) Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals, [Paper]

  • (arXiv 2021.10) COMPOSITIONAL ATTENTION: DISENTANGLING SEARCH AND RETRIEVAL, [Paper], [Code]

  • (arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]

  • (arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper]

  • (arXiv 2021.10) Transformer with a Mixture of Gaussian Keys, [Paper]

  • (arXiv 2021.10) DIFFUSIONCLIP: TEXT-GUIDED IMAGE MANIPULATION USING DIFFUSION MODELS, [Paper]

  • (arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]

  • (arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH SUB-QUADRATIC COMPLEXITY, [Paper]

  • (arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]

  • (arXiv 2021.10) CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, [Paper]

  • (arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]

  • (arXiv 2021.10) SPARSE MOES MEET EFFICIENT ENSEMBLES, [Paper]

  • (arXiv 2021.10) Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? [Paper]

  • (arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition, [Paper]

  • (arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper]

  • (arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]

  • (arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM, [Paper], [Code]

  • (arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for Video Caption, [Paper]

  • (arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper]

  • (arXiv 2021.10) VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN, [Paper]

  • (arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]

  • (arXiv 2021.10) NVIT: VISION TRANSFORMER COMPRESSION AND PARAMETER REDISTRIBUTION, [Paper]

  • (arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]

  • (arXiv 2021.10) CLIP-Adapter: Better Vision-Language Models with Feature Adapters, [Paper], [Code]

  • (arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Code]

  • (arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER, [Paper]

  • (arXiv 2021.10) TOKEN POOLING IN VISION TRANSFORMERS, [Paper]

  • (arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED OBJECT DETECTOR, [Paper], [Code]

  • (arXiv 2021.10) CLIP4Caption: CLIP for Video Caption, [Paper]

  • (arXiv 2021.10) OBJECT-REGION VIDEO TRANSFORMERS, [Paper], [Code]

  • (arXiv 2021.10) LEVERAGING REDUNDANCY IN ATTENTION WITH REUSE TRANSFORMERS, [Paper]

  • (arXiv 2021.10) Dynamic Inference with Neural Interpreters, [Paper]

  • (arXiv 2021.10) A CLIP-Enhanced Method for Video-Language Understanding, [Paper]

  • (arXiv 2021.10) Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries, [Paper]

  • (arXiv 2021.10) Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection, [Paper]

  • (arXiv 2021.10) Learning Structural Representations for Recipe Generation and Food Retrieval, [Paper]

  • (arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION, [Paper]

2021.09

  • (arXiv 2021.09) Joint Multimedia Event Extraction from Video and Article, [Paper]

  • (arXiv 2021.09) Long-Range Transformers for Dynamic Spatiotemporal Forecasting, [Paper]

  • (arXiv 2021.09) Visually Grounded Concept Composition, [Paper]

  • (arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation, [Paper]

  • (arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]

  • (arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]

  • (arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]

  • (arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper], [Code]

  • (arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]

  • (arXiv 2021.09) VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, [Paper], [Code]

  • (arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]

  • (arXiv 2021.09) CLIP-It! Language-Guided Video Summarization, [Paper], [Project]

  • (arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D FACIAL EXPRESSION RECOGNITION, [Paper]

  • (arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper], [Code]

  • (arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]

  • (arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper]

  • (arXiv 2021.09) MLIM: VISION-AND-LANGUAGE MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [Paper]

  • (arXiv 2021.09) Dense Contrastive Visual-Linguistic Pretraining, [Paper]

  • (arXiv 2021.09) CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED VISION-LANGUAGE MODELS, [Paper]

  • (arXiv 2021.09) Localizing ∞-shaped fishes: Sketch-guided object localization in the wild, [Paper], [Code]

  • (arXiv 2021.09) CLIPORT: What and Where Pathways for Robotic Manipulation, [Paper], [Project], [Code]

  • (arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]

  • (arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]

  • (arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper], [Code]

  • (arXiv 2021.09) LOTR: Face Landmark Localization Using Localization Transformer, [Paper]

  • (arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]

  • (arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]

  • (arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]

  • (arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]

  • (arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]

  • (arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION, [Paper]

  • (arXiv 2021.09) ActionCLIP: A New Paradigm for Video Action Recognition, [Paper]

  • (arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]

  • (arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering, [Paper], [Code]

  • (arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper], [Code]

  • (arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper], [Code]

  • (arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]

  • (arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]

  • (arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper]

  • (arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]

  • (arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]

  • (arXiv 2021.09) Learning to Ground Visual Objects for Visual Dialog, [Paper]

  • (arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]

  • (arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]

  • (arXiv 2021.09) IS ATTENTION BETTER THAN MATRIX DECOMPOSITION? [Paper], [Code]

  • (arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]

  • (arXiv 2021.09) Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization, [Paper]

  • (arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding, [Paper]

  • (arXiv 2021.09) LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation, [Paper], [Code]

  • (arXiv 2021.09) Panoptic Narrative Grounding, [Paper]

  • (arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA, [Paper]

  • (arXiv 2021.09) PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks, [Paper], [Project]

  • (arXiv 2021.09) EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [Paper]

  • (arXiv 2021.09) Scaled ReLU Matters for Training Vision Transformers, [Paper]

  • (arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]

  • (arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper]

  • (arXiv 2021.09) WHYACT: Identifying Action Reasons in Lifestyle Vlogs, [Paper]

  • (arXiv 2021.09) Zero-Shot Open Set Detection by Extending CLIP, [Paper]

  • (arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]

  • (arXiv 2021.09) Learning to Prompt for Vision-Language Models, [Paper], [Code]

  • (arXiv 2021.09) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss, [Paper], [Code]

  • (arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]

  • (arXiv 2021.09) ConvMLP: Hierarchical Convolutional MLPs for Vision, [Paper], [Code]

  • (arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]

  • (arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]

  • (arXiv 2021.09) Sparse-MLP: A Fully-MLP Architecture with Conditional Computation, [Paper]

  • (arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential Manipulation, [Paper], [Project]

  • (arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper]

  • (arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for Visual Question Answering, [Paper], [Code]

  • (arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE SUPER-RESOLUTION, [Paper]

  • (arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]

  • (arXiv 2021.09) Learning to Generate Scene Graph from Natural Language Supervision, [Paper], [Code]

  • (arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]

  • (arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]

  • (ICCV 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]

  • (arXiv 2021.09) Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation, [Paper], [Code]

  • (arXiv 2021.09) Joint Graph Learning and Matching for Semantic Feature Correspondence, [Paper]

  • (arXiv 2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper], [Code]

2021.08

  • (arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation, [Paper]

  • (arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]

  • (arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]

  • (arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]

  • (arXiv 2021.08) Cross-category Video Highlight Detection via Set-based Learning, [Paper], [Code]

  • (arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]

  • (arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments, [Paper]

  • (arXiv 2021.08) LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision, [Paper], [Project]

  • (arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]

  • (arXiv 2021.08) SIMVLM: SIMPLE VISUAL LANGUAGE MODEL PRETRAINING WITH WEAK SUPERVISION, [Paper]

  • (arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [[Paper]](https://

About

Recent Transformer-based CV and related works.