There are 0 repository under vision-language-models topic.
EVE Series: Encoder-Free Vision-Language Models from BAAI
Official Implementation for "MyVLM: Personalizing VLMs for User-Specific Queries" (ECCV 2024)
This repo is a live list of papers on game playing and large multimodality model - "A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges".
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
[ICASSP 2025] Open-source code for the paper "Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification"
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
[CVPR 2025 Highlight] Official Pytorch codebase for paper: "Assessing and Learning Alignment of Unimodal Vision and Language Models"
[ICML 2024] Offical code repo for ICML2024 paper "Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data"
This is an official repository for "Harnessing Vision Models for Time Series Analysis: A Survey".
[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs
Official code for "Can We Talk Models Into Seeing the World Differently?" (ICLR 2025).
Awesome Vision-Language Compositionality, a comprehensive curation of research papers in literature.
[EMNLP 2024] Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
This is a curated list of "Continual Learning with Pretrained Models" research.
This is an official implementation of our work, Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models, accepted to ECCV'24
[CVPR 2025] Official implementation of the paper "Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis"
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
[ICLR 2025 Oral] Official Implementation for "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities"
Official repo of the paper "Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models"
PicQ: Demo for MiniCPM-o 2.6 to answer questions about images using natural language.
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition – EMNLP 2024 (Findings)
Streamlit App Combining Vision, Language, and Audio AI Models
TOCFL-MultiBench: A multimodal benchmark for evaluating Chinese language proficiency using text, audio, and visual data with deep learning. Features Selective Token Constraint Mechanism (STCM) for enhanced decoding stability.
VLDBench: A large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection.
Code for Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-Training [IJCV 2024], Rethinking the Role of Pre-Trained Networks in Source-Free Domain Adaptation [ICCV 2023]
VidiQA: Demo for MiniCPM-V 2.6 to answer questions about videos using natural language.
This project explores the use of large foundational vision-language models in reinforcement learning, where the models function as agents, reward functions, or reward function code generators in unseen environments given a state and a goal.
ScreenGPT is a project that leverages LLM to understand the screen content. It provides response based on the user defined prompts and the screen content. You need an OpenAI compatible API key to use this software.
Code release for THRONE, a CVPR 2024 paper on measuring object hallucinations in LVLM generated text.
An innovative mixed reality (MR) pipeline that integrates real-time instance segmentation and speech-guided natural language interaction. It aims to create a more intuitive and immersive experience for users interacting with virtual and real-world environments.