There are 8 repositories under mllm topic.
Agent S: an open agentic framework that uses computers like a human
Mobile-Agent: The Powerful GUI Agent Family
[NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling
[CVPR'25] Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 🍓
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
🚀🚀🚀 A collection of some awesome public YOLO object detection series projects and the related object detection datasets.
Official Repo For "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos"
OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
🚀🚀🚀A collection of some awesome public projects about Large Language Model(LLM), Vision Language Model(VLM), Vision Language Action(VLA), AI Generated Content(AIGC), the related Datasets and Applications.
[NeurIPS' 2025] JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
[NeurIPS 2025] 4KAgent: Agentic Any Image to 4K Super-Resolution. An intelligent computer vision agent that can magically restore any image to perfect-4K!
Fully Open Framework for Democratized Multimodal Training
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
This is a repository for listing papers on scene graph generation and application.
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024 Oral
EVE Series: Encoder-Free Vision-Language Models from BAAI
Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks