mllm

There are 5 repositories under mllm topic.

microsoft / unilm
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
beit beit-3 bitnet deepnet document-ai foundation-models kosmos kosmos-1 layoutlm layoutxlm llm minilm mllm multimodal nlp pre-trained-model textdiffuser trocr unilm xlm-e
Language:Python 19569
X-PLUG / MobileAgent
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
agent gpt4v mllm mobile-agents multimodal multimodal-large-language-models multimodal-agent android app gui mobile automation copilot harmony ios
Language:Python 2702
InternLM / InternLM-XComposer
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
chatgpt visual-language-learning multi-modality foundation gpt-4 instruction-tuning mllm multimodal vision-language-model language-model large-language-model large-vision-language-model llm vision-transformer gpt supervised-finetuning
Language:Python 2461
cambrian-mllm / cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
chatbot clip computer-vision dino instruction-tuning large-language-models llms mllm multimodal-large-language-models representation-learning
Language:Python 1685
atfortes / Awesome-LLM-Reasoning
Reasoning in Large Language Models: Papers and Resources, including Chain-of-Thought and OpenAI o1 🍓
language-models reasoning prompt in-context-learning chatgpt chain-of-thought prompt-engineering cot awesome gpt mllm multimodal papers gpt-4o openai-o1 strawberry
1548
X-PLUG / mPLUG-DocOwl
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding
Language:Python 1329
BAAI-DCAI / Bunny
A family of lightweight multimodal models.
chatgpt chinese english gpt-4 mllm multimodal-large-language-models vlm
Language:Python 882
CircleRadon / Osprey
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
mllm pixel-understanding sam visual-instruction-tuning
Language:Python 749
BradyFU / Woodpecker
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models. The first work to correct hallucinations in MLLMs.
hallucination hallucinations large-language-models llm mllm multimodal-large-language-models multimodality
Language:Python 594
FoundationVision / Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
grounding llm mllm large-language-models foundation-models llama llama2 multimodal vision-language-model
Language:Python 542
dvlab-research / LLMGA
This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024 Oral
aigc image-design-assistant image-editing image-generation large-language-model llm mllm multi-modal
Language:Python 446
NVlabs / EAGLE
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
demo eagle gpt4 huggingface large-language-models llama llama3 llava llm lmm lvlm mllm nvdia
Language:Python 405
gokayfem / ComfyUI_VLM_nodes
Custom ComfyUI nodes for Vision Language Models, Large Language Models, Image to Music, Text to Music, Consistent and Random Creative Prompt Generation
nodes comfyui custom-nodes llava llm siglip phi15 img2text joytag image-captioning mllm vlm img2sfx
Language:Python 365
Coobiw / MPP-LLaVA
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
multimodal-large-language-models deepspeed model-parallel pipeline-parallelism mllm qwen fine-tuning pretraining video-language-model video-large-language-models
Language:Jupyter Notebook 356
X-PLUG / Youku-mPLUG
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
benchmark chinese dataset mllm multimodal multimodal-large-language-models multimodal-pretraining video video-question-answering video-retrieval youku
Language:Python 280
Atomic-man007 / Awesome_Multimodel_LLM
Awesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.
chatgpt dataset gpt llm mllm multimodel nlp pretrained-models
248
X-PLUG / mPLUG-2
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)
foundation-models mllm multimodal multimodal-pretraining video image-retrieval mplug video-question-answering video-retrieval vqa
Language:Python 215
baaivision / EVE
EVE: Encoder-Free Vision-Language Models
clip encoder-free-vlm instruction-following large-language-models llm mllm multimodal-large-language-models vision-language-models vlm
Language:Python 208
TIGER-AI-Lab / Mantis
Official code for Paper "Mantis: Multi-Image Instruction Tuning"
language vision fuyu llava-llama3 lmm mantis mllm video vlm multi-image-understanding multimodal
Language:Python 158
CircleRadon / TokenPacker
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
connector lmm mllm token-reduction visual-projector tokenpacker
Language:Python 155
bz-lab / AUITestAgent
AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.
agent automation gpt-4o gui llm mllm mobile-app multi-agent multimodal multimodal-agent testing
137
FoundationVision / GenerateU
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
mllm multimodality object-detection open-vocabulary open-vocabulary-detection open-world
Language:Python 128
sterzhang / image-textualization
Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions
dense-captioning mllm text-image
Language:Python 125
baaivision / DenseFusion
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
image-descriptions mllm multimodal-large-language-models vision-language-models visual-perception vlm
Language:Python 104
360CVGroup / SEEChat
Multimodal chatbot with computer vision capabilities integrated
chatbot gpt4 mllm
Language:Python 98
graphic-design-ai / graphist
Official Repo of Graphist
graphic-design hlg layout-generation llm lmm mllm
92
thu-ml / MMTrustEval
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust)
benchmark claude fairness gpt-4 mllm multi-modal privacy robustness safety toolbox trustworthy-ai truthfulness
Language:Python 88
TideDra / VL-RLHF
A RLHF Infrastructure for Vision-Language Models
dpo llm lmm mllm rlhf vlm
Language:Python 86
Ahnsun / merlin
[ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds
mllm
Language:Python 80
X-PLUG / mPLUG-HalOwl
mPLUG-HalOwl: Multimodal Hallucination Evaluation and Mitigating
benchmark contrastive-learning hallucinations mllm multimodal-hallucination multimodal-large-language-models
Language:Python 75
BAAI-DCAI / DataOptim
A collection of visual instruction tuning datasets.
llm mllm visual-instruction-tuning
Language:Python 74
ZebangCheng / Emotion-LLaMA
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
affective-computing instruction-tuning mllm
Language:Python 69
KwaiVGI / Uniaa
Unified Multi-modal IAA Baseline and Benchmark
benchmark dataset image-aesthetic-assessment llava mllm
68
Hon-Wong / Elysium
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
benchmark dataset eccv2024 gpt mllm sot tracking vlm eccv visual-object-tracking video-processing
Language:Python 47
VisualWebBench / VisualWebBench
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
computer-vision deep-learning evaluation foundation-models large-language-models large-multimodal-models llm llms machine-learning mllm multimodal multimodal-deep-learning multimodal-large-language-models natural-language-processing question-answering visual-question-answering
Language:Python 39
bonjour-npy / UndergraduateDissertation
Undergraduate Dissertation of Guilin University of Electronic Technology
llm prompt-learning prompt-tuning mllm
Language:Python 38

mllm

microsoft / unilm

X-PLUG / MobileAgent

InternLM / InternLM-XComposer

cambrian-mllm / cambrian

atfortes / Awesome-LLM-Reasoning

X-PLUG / mPLUG-DocOwl

BAAI-DCAI / Bunny

CircleRadon / Osprey

BradyFU / Woodpecker

FoundationVision / Groma

dvlab-research / LLMGA

NVlabs / EAGLE

gokayfem / ComfyUI_VLM_nodes

Coobiw / MPP-LLaVA

X-PLUG / Youku-mPLUG

Atomic-man007 / Awesome_Multimodel_LLM

X-PLUG / mPLUG-2

baaivision / EVE

TIGER-AI-Lab / Mantis

CircleRadon / TokenPacker

bz-lab / AUITestAgent

FoundationVision / GenerateU

sterzhang / image-textualization

baaivision / DenseFusion

360CVGroup / SEEChat

graphic-design-ai / graphist

thu-ml / MMTrustEval

TideDra / VL-RLHF

Ahnsun / merlin

X-PLUG / mPLUG-HalOwl

BAAI-DCAI / DataOptim

ZebangCheng / Emotion-LLaMA

KwaiVGI / Uniaa

Hon-Wong / Elysium

VisualWebBench / VisualWebBench

bonjour-npy / UndergraduateDissertation