There are 5 repositories under large-vision-language-models topic.
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
Curated papers on Large Language Models in Healthcare and Medical domain
[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions
A curated list of recent and past chart understanding work based on our IEEE TKDE survey paper: From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models.
[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources
A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository aggregates surveys, blog posts, and research papers that explore how LMMs represent, transform, and align multimodal information internally.
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
[ACM Multimedia 2025] This is the official repo for Debiasing Large Visual Language Models, including a Post-Hoc debias method and Visual Debias Decoding strategy.
[CVPR 2025 🔥] EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues.
[ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models.
✨A curated list of papers on the uncertainty in multi-modal large language model (MLLM).
This repository is the codebase of TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
An benchmark for evaluating the capabilities of large vision-language models (LVLMs)
Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)
[ICLR 2025] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality
PyTorch Implementation of the Paper 'AnyAnomaly': Official Version
Awesome Large Vision-Language Model: A Curated List of Large Vision-Language Model
🚀 Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
[NeurIPS 2024] Official Repository of Multi-Object Hallucination in Vision-Language Models
[CVPR 2025] Implementation of "Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models"
The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"
Code and data for the ACL 2024 Findings paper "Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning"
LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently (ICML2025 Oral)
Latest Advances on Modality Priors in Multimodal Large Language Models
[CVPR 2024 CVinW] Multi-Agent VQA: Exploring Multi-Agent Foundation Models on Zero-Shot Visual Question Answering