OpenGVLab

OpenGVLab

Organization data from Github https://github.com/OpenGVLab

General Vision Team of Shanghai AI Laboratory

Home Page:https://opengvlab.shlab.org.cn

GitHub:@OpenGVLab

Twitter:@opengvlab

OpenGVLab's repositories

InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Language:PythonLicense:MITStargazers:9444Issues:65Issues:1079

InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Language:PythonLicense:Apache-2.0Stargazers:2099Issues:25Issues:281

OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Language:PythonLicense:MITStargazers:867Issues:15Issues:91

ScaleCUA

ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).

Language:PythonLicense:Apache-2.0Stargazers:804Issues:0Issues:0

VideoChat-Flash

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

Language:PythonLicense:MITStargazers:478Issues:20Issues:72

OmniCorpus

[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

PonderV2

[T-PAMI 2025] PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm

Language:PythonLicense:MITStargazers:363Issues:19Issues:30

EfficientQAT

[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

VideoChat-R1

[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning

EgoVideo

[CVPR 2024 Champions][ICLR 2025] Solutions for EgoVis Chanllenges in CVPR 2024

Language:Jupyter NotebookStargazers:131Issues:1Issues:20

GUI-Odyssey

[ICCV 2025] GUIOdyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUIOdyssey consists of 8,834 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 212 apps, and 1.4K app combos.

PIIP

[NeurIPS 2024 Spotlight ⭐️ & TPAMI 2025] Parameter-Inverted Image Pyramid Networks (PIIP)

Language:PythonLicense:MITStargazers:105Issues:6Issues:5

ZeroGUI

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Language:PythonLicense:Apache-2.0Stargazers:100Issues:1Issues:14

Mono-InternVL

[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Language:PythonLicense:MITStargazers:91Issues:2Issues:8

VeBrain

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

Language:PythonLicense:MITStargazers:83Issues:0Issues:0

MUTR

「AAAI 2024」 Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation

Language:PythonLicense:MITStargazers:82Issues:2Issues:8

EgoExoLearn

[CVPR 2024] Data and benchmark code for the EgoExoLearn dataset

Language:PythonLicense:MITStargazers:70Issues:1Issues:10

SDLM

Sequential Diffusion Language Model (SDLM) enhances pre-trained autoregressive language models by adaptively determining generation length and maintaining KV-cache compatibility, achieving high efficiency and throughput.

Language:PythonLicense:MITStargazers:68Issues:0Issues:4

LORIS

[ICML2023] Long-Term Rhythmic Video Soundtracker

Language:PythonLicense:MITStargazers:60Issues:5Issues:7

TPO

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Language:Jupyter NotebookStargazers:60Issues:1Issues:3

PVC

[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Language:PythonLicense:MITStargazers:50Issues:0Issues:4

GenExam

GenExam: A Multidisciplinary Text-to-Image Exam

Language:PythonLicense:MITStargazers:39Issues:2Issues:0
Language:PythonStargazers:38Issues:0Issues:0

Docopilot

[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

Language:PythonLicense:MITStargazers:35Issues:2Issues:4

FluxViT

Make Your Training Flexible: Towards Deployment-Efficient Video Models

Language:PythonLicense:MITStargazers:33Issues:2Issues:1

Vlaser

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Language:PythonLicense:MITStargazers:26Issues:0Issues:0

VRBench

[ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Language:PythonLicense:Apache-2.0Stargazers:21Issues:1Issues:1

SID-VLN

Official implementation of: Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale

Language:PythonLicense:MITStargazers:8Issues:0Issues:0
Stargazers:6Issues:0Issues:0