There are 24 repositories under vision-transformer topic.
OpenMMLab Detection Toolbox and Benchmark
pix2tex: Using a ViT to convert images of equations into LaTeX code.
This repository contains demos I made with the Transformers library by HuggingFace.
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
SwinIR: Image Restoration Using Swin Transformer (official repository)
An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites
Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab.
OpenMMLab Pre-training Toolbox and Benchmark
Scenic: A Jax Library for Computer Vision Research and Beyond
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
Efficient vision foundation models for high-resolution generation and perception.
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
EVA Series: Visual Representation Fantasies from BAAI
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
[CVPR 2021] Official PyTorch implementation for Transformer Interpretability Beyond Attention Visualization, a novel method to visualize classifications by Transformer based networks.
The official repo for [NeurIPS'22] "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation" and [TPAMI'23] "ViTPose++: Vision Transformer for Generic Body Pose Estimation"
[CVPR 2025] Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
VRT: A Video Restoration Transformer (official repository)
[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions
ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Awesome List of Attention Modules and Plug&Play Modules in Computer Vision
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Explainability for Vision Transformers
A curated list of foundation models for vision and language tasks
UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery, ISPRS. Also, including other vision transformers and CNNs for satellite, aerial image and UAV image segmentation.
SOTA Semantic Segmentation Models in PyTorch
[ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention
Repository of Vision Transformer with Deformable Attention (CVPR2022) and DAT++: Spatially Dynamic Vision Transformerwith Deformable Attention