There are 21 repositories under vision-transformer topic.
OpenMMLab Detection Toolbox and Benchmark
pix2tex: Using a ViT to convert images of equations into LaTeX code.
This repository contains demos I made with the Transformers library by HuggingFace.
An ultimately comprehensive paper list of Vision Transformer/Attention, including papers, codes, and related websites
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
SwinIR: Image Restoration Using Swin Transformer (official repository)
Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab.
[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
OpenMMLab Pre-training Toolbox and Benchmark
Scenic: A Jax Library for Computer Vision Research and Beyond
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
EVA Series: Visual Representation Fantasies from BAAI
[CVPR 2021] Official PyTorch implementation for Transformer Interpretability Beyond Attention Visualization, a novel method to visualize classifications by Transformer based networks.
EfficientViT is a new family of vision models for efficient high-resolution vision.
VRT: A Video Restoration Transformer (official repository)
The official repo for [NeurIPS'22] "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation" and [TPAMI'23] "ViTPose++: Vision Transformer for Generic Body Pose Estimation"
[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Awesome List of Attention Modules and Plug&Play Modules in Computer Vision
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Extract markdown and images from URLs, PDFs, docs, slides, and more, ready for multimodal LLMs. ⚡
(ICLR 2022 Spotlight) Official PyTorch implementation of "How Do Vision Transformers Work?"
SOTA Semantic Segmentation Models in PyTorch
Explainability for Vision Transformers
[ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention
Repository of Vision Transformer with Deformable Attention (CVPR2022) and DAT++: Spatially Dynamic Vision Transformerwith Deformable Attention
Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(NeurIPS, 2021) paper
A curated list of foundation models for vision and language tasks
Vision-Centric BEV Perception: A Survey