There are 5 repositories under mllm topic.
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
Reasoning in Large Language Models: Papers and Resources, including Chain-of-Thought and OpenAI o1 🍓
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models. The first work to correct hallucinations in MLLMs.
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024 Oral
Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.
EVE: Encoder-Free Vision-Language Models
Official code for Paper "Mantis: Multi-Image Instruction Tuning"
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust)
mPLUG-HalOwl: Multimodal Hallucination Evaluation and Mitigating
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
Undergraduate Dissertation of Guilin University of Electronic Technology