vision-language-model

There are 33 repositories under vision-language-model topic.

haotian-liu / LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
chatbot chatgpt foundation-models gpt-4 instruction-tuning llama llama-2 llama2 llava multi-modality multimodal vision-language-model visual-language-learning
Language:Python 23544
OpenGVLab / InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
image-classification image-text-retrieval llm semantic-segmentation video-classification vision-language-model vit-22b vit-6b multi-modal gpt gpt-4v gpt-4o
Language:Python 9141
QwenLM / Qwen-VL
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
large-language-models vision-language-model
Language:Python 6226
deepseek-ai / DeepSeek-VL
DeepSeek-VL: Towards Real-World Vision-Language Understanding
vision-language-model vision-language-pretraining foundation-models
Language:Python 3764
PKU-Alignment / align-anything
Align Anything: Training All-modality Model with Feedback
chameleon dpo large-language-models multimodal rlhf vision-language-model
Language:Jupyter Notebook 3299
dvlab-research / MGM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
generation large-language-models vision-language-model
Language:Python 3265
jingyi0000 / VLM_survey
Collection of AWESOME vision-language models for vision tasks
clip computer-vision deep-learning knowledge-distillation multi-modal-model survey transfer-learning vision-language-model
2921
InternLM / InternLM-XComposer
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
chatgpt visual-language-learning multi-modality foundation gpt-4 instruction-tuning mllm multimodal vision-language-model language-model large-language-model large-vision-language-model llm vision-transformer gpt supervised-finetuning
Language:Python 2805
MiniMax-AI / MiniMax-01
The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
large-language-models llm llms minimax-text-01 minimax-vl-01 vision-language-model vlm
Language:Python 2488
jingyaogong / minimind-v
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM！🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
artificial-intelligence chatgpt vision-language-model
Language:Python 2373
BAAI-Agents / Cradle
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.
ai ai-agent ai-agents-framework computer-control cradle foundation-agent gcc general-computer-control generative-ai grounding large-language-models llm lmm multimodality personoid vision-language-model vlm
Language:Python 2271
AlibabaResearch / AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence documentai multimodal multimodal-deep-learning ocr computer-vision vision-language-transformer end-to-end-ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language document document-analysis document-recognition document-understanding document-intelligence vision-language-model
Language:C++ 1769
illuin-tech / colpali
The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.
colpali information-retrieval retrieval-augmented-generation vision-language-model colqwen2 colsmol
Language:Python 1702
showlab / ShowUI
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
agent computer-use gui-agent vision-language-action vision-language-model
Language:Python 1472
NVlabs / prismer
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
image-captioning language-model multi-modal-learning multi-task-learning vision-language-model vision-and-language vqa
Language:Python 1309
Blaizzy / mlx-vlm
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
apple-silicon florence2 idefics llava llm local-ai mlx molmo paligemma pixtral vision-framework vision-language-model vision-transformer
Language:Python 1153
llm-jp / awesome-japanese-llm
日本語LLMまとめ - Overview of Japanese LLMs
language-model language-models large-language-model large-language-models llm llms japanese japanese-language vision-and-language foundation-models multimodal vision-language vision-language-model generative-ai generative-model generative-models japanese-llm japanese-language-model llm-japanese
Language:TypeScript 1129
SkalskiP / vlms-zero-to-hero
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
bert-model clip computer-vision embeddings gpt gpt-2 lora natural-language-processing seq2seq vision-language-model word2vec
Language:Jupyter Notebook 1126
PKU-YuanGroup / Chat-UniVi
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
image-understanding large-language-models video-understanding vision-language-model
Language:Python 932
AIDC-AI / Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
chatbot llama3 multimodal multimodal-large-language-models multimodality qwen vision-language-learning vision-language-model
Language:Python 881
mbzuai-oryx / groundingLMM
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
foundation-models lmm vision-and-language vision-language-model llm-agent
Language:Python 862
SunzeY / AlphaCLIP
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
deep-learning machine-learning vision-language vision-language-model vision-transformer vision-and-language
Language:Jupyter Notebook 801
gokayfem / awesome-vlm-architectures
Famous Vision Language Models and Their Architectures
clip llava vlm image-encoder text-encoder multimodal blip cogvlm internlm kosmos qwen-vl vision-language-model awesome awesome-list
Language:Markdown 770
Awesome-Robotics-3D
zubair-irshad / Awesome-Robotics-3D
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
3d benchmarks computer-vision gaussian-splatting grasping llm manipulation nerf pointclouds policy-learning pretraining robotics scene-graph simulations vision-language-model vlm diffusion-models foundation-models navigation
758
StarlightSearch / EmbedAnything
Highly Performant, Modular and Production-ready Inference, Ingestion and Indexing built in Rust 🦀
embedding-models large-language-models machine-learning indexing rag rust rust-lang colpali information-retrieval vision-language-model ingestion vector-database high-performance inference python vector-search production hybrid onnx-runtime onnxruntime
Language:Rust 714
OpenBMB / VisRAG
Parsing-free RAG supported by VLMs
document-retrieval document-understanding multi-modal multi-modality rag retrieval retrieval-augmented-generation vision-language-model
Language:Python 661
huangwl18 / VoxPoser
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
embodied-ai foundation-models large-language-models motion-planning robotic-manipulation robotics vision-language-model
Language:Python 656
2U1 / Qwen2-VL-Finetune
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
chatbot multimodal qwen2-5 qwen2-vl vision-language vision-language-model
Language:Python 605
FoundationVision / Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
grounding llm mllm large-language-models foundation-models llama llama2 multimodal vision-language-model
Language:Python 555
Flame-Code-VLM
Flame-Code-VLM / Flame-Code-VLM
Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.
code-generation data-synthesis frontend-development vision-language-model ai deep-learning frontend multimodal open-source react vlm deepseek design-to-code front-end image-to-code image-to-text llm vue screen-to-code
Language:Python 540
neonwatty / meme-search
The open source Meme Search Engine and Finder. Free and built to self-host locally with Python, Ruby, and Docker.
docker machine-learning python ruby-on-rails self-hosted vector-database vision-language-model
Language:Ruby 536
OpenGVLab / Multi-Modality-Arena
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
chat chatbot chatgpt gradio large-language-models llms multi-modality vision-language-model vqa
Language:Python 511
zhengli97 / Awesome-Prompt-Adapter-Learning-for-VLMs
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
adapter-learning few-shot-classifcation few-shot-learning paper-list prompt-learning vision-language-model zero-shot-learning
490
AlaaLab / InstructCV
[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"
diffusion-models generative-model multi-task-learning stable-diffusion text-to-image vision-language-model
Language:Python 463
PJLab-ADG / awesome-knowledge-driven-AD
A curated list of awesome knowledge-driven autonomous driving (continually updated)
autonomous-driving knowledge-driven large-language-models vision-language-model
458
ictnlp / LLaVA-Mini
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
efficient gpt4o gpt4v large-language-models large-multimodal-models llava multimodal video vision vision-language-model visual-instruction-tuning llama multimodal-large-language-models
Language:Python 443