LLM-in-Vision

Recent LLM (Large Language Models)-based CV and multi-modal works. Welcome to comment/contribute!

2023.11

(arXiv 2023.11) DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback, [Paper], [Code]
(arXiv 2023.11) GAIA: A Benchmark for General AI Assistants, [Paper], [Project]
(arXiv 2023.11) PG-Video-LLaVA: Pixel Grounding Large Video-Language Models, [Paper], [Code]
(arXiv 2023.11) Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge, [Paper]
(arXiv 2023.11) AN EMBODIED GENERALIST AGENT IN 3D WORLD, [Paper], [Project]
(arXiv 2023.11) ShareGPT4V: Improving Large Multi-Modal Models with Better Captions, [Paper], [Project]
(arXiv 2023.11) KNVQA: A Benchmark for evaluation knowledge-based VQA, [Paper]
(arXiv 2023.11) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning, [Paper], [Project]
(arXiv 2023.11) Boosting Audio-visual Zero-shot Learning with Large Language Models, [Paper], [Code]
(arXiv 2023.11) Few-Shot Classification & Segmentation Using Large Language Models Agent, [Paper]
(arXiv 2023.11) Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents, [Paper], [Code]
(arXiv 2023.11) VLM-Eval: A General Evaluation on Video Large Language Models, [Paper]
(arXiv 2023.11) LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions, [Paper], [Code]
(arXiv 2023.11) LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge, [Paper], [Project]
(arXiv 2023.11) Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding, [Paper], [Code]
(arXiv 2023.11) How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model, [Paper]
(arXiv 2023.11) Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models, [Paper], [Code]
(arXiv 2023.11) Towards Open-Ended Visual Recognition with Large Language Model, [Paper], [Code]
(arXiv 2023.11) Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models, [Paper], [Code]
(arXiv 2023.11) VILMA: A ZERO-SHOT BENCHMARK FOR LINGUISTIC AND TEMPORAL GROUNDING IN VIDEO-LANGUAGE MODELS, [Paper], [Project]
(arXiv 2023.11) VOLCANO: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision, [Paper], [Code]
(arXiv 2023.11) AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation, [Paper], [Code]
(arXiv 2023.11) Analyzing Modular Approaches for Visual Question Decomposition, [Paper], [Code]
(arXiv 2023.11) LayoutPrompter: Awaken the Design Ability of Large Language Models, [Paper], [Code]
(arXiv 2023.11) PerceptionGPT: Effectively Fusing Visual Perception into LLM, [Paper]
(arXiv 2023.11) InfMLLM: A Unified Framework for Visual-Language Tasks, [Paper], [Code]
(arXiv 2023.11) WHAT LARGE LANGUAGE MODELS BRING TO TEXTRICH VQA?, [Paper]
(arXiv 2023.11) Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text, [Paper], [Project]
(arXiv 2023.11) GPT-4V(ision) as A Social Media Analysis Engine, [Paper], [Code]
(arXiv 2023.11) GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation, [Paper], [Code]
(arXiv 2023.11) To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning, [Paper], [Code]
(arXiv 2023.11) SPHINX: THE JOINT MIXING OF WEIGHTS, TASKS, AND VISUAL EMBEDDINGS FOR MULTI-MODAL LARGE LANGUAGE MODELS, [Paper], [Code]
(arXiv 2023.11) ADAPT: As-Needed Decomposition and Planning with Language Models, [Paper], [Project]
(arXiv 2023.11) JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models, [Paper], [Project]
(arXiv 2023.11) Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, [Paper]
(arXiv 2023.11) Multitask Multimodal Prompted Training for Interactive Embodied Task Completion, [Paper], [Code]
(arXiv 2023.11) TEAL: TOKENIZE AND EMBED ALL FOR MULTIMODAL LARGE LANGUAGE MODELS, [Paper]
(arXiv 2023.11) u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model, [Paper]
(arXiv 2023.11) LLAVA-PLUS: LEARNING TO USE TOOLS FOR CREATING MULTIMODAL AGENTS, [Paper], [Project]
(arXiv 2023.11) Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models, [Paper], [Code]
(arXiv 2023.11) OtterHD: A High-Resolution Multi-modality Model, [Paper], [Code]
(arXiv 2023.11) NExT-Chat: An LMM for Chat, Detection and Segmentation, [Paper], [Project]
(arXiv 2023.11) GENOME: GENERATIVE NEURO-SYMBOLIC VISUAL REASONING BY GROWING AND REUSING MODULES, [Paper], [Project]
(arXiv 2023.11) MAKE A DONUT: LANGUAGE-GUIDED HIERARCHICAL EMD-SPACE PLANNING FOR ZERO-SHOT DEFORMABLE OBJECT MANIPULATION, [Paper]
(arXiv 2023.11) Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs, [Paper], [Code]
(arXiv 2023.11) Accelerating Reinforcement Learning of Robotic Manipulations via Feedback from Large Language Models, [Paper]
(arXiv 2023.11) ROBOGEN: TOWARDS UNLEASHING INFINITE DATA FOR AUTOMATED ROBOT LEARNING VIA GENERATIVE SIMULATION, [Paper]

2023.10

(arXiv 2023.10) MINIGPT-5: INTERLEAVED VISION-AND-LANGUAGE GENERATION VIA GENERATIVE VOKENS, [Paper], [Code]
(arXiv 2023.10) What’s “up” with vision-language models? Investigating their struggle with spatial reasoning, [Paper], [Code]
(arXiv 2023.10) APOLLO: ZERO-SHOT MULTIMODAL REASONING WITH MULTIPLE EXPERTS, [Paper], [Code]
(arXiv 2023.10) ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense, [Paper]
(arXiv 2023.10) Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models, [Paper], [Project]
(arXiv 2023.10) LARGE LANGUAGE MODELS AS GENERALIZABLE POLICIES FOR EMBODIED TASKS, [Paper], [Project]
(arXiv 2023.10) Humanoid Agents: Platform for Simulating Human-like Generative Agents, [Paper], [Project]
(arXiv 2023.10) REVO-LION: EVALUATING AND REFINING VISION-LANGUAGE INSTRUCTION TUNING DATASETS, [Paper], [Code]
(arXiv 2023.10) How (not) to ensemble LVLMs for VQA, [Paper]
(arXiv 2023.10) What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models, [Paper], [Code]
(arXiv 2023.10) Words into Action: Learning Diverse Humanoid Robot Behaviors using Language Guided Iterative Motion Refinement, [Paper], [Code]
(arXiv 2023.10) GameGPT: Multi-agent Collaborative Framework for Game Development, [Paper]
(arXiv 2023.10) STEVE-EYE: EQUIPPING LLM-BASED EMBODIED AGENTS WITH VISUAL PERCEPTION IN OPEN WORLDS, [Paper]
(arXiv 2023.10) BENCHMARKING SEQUENTIAL VISUAL INPUT REASONING AND PREDICTION IN MULTIMODAL LARGE LANGUAGE MODELS, [Paper], [Code]
(arXiv 2023.10) A Simple Baseline for Knowledge-Based Visual Question Answering, [Paper], [Code]
(arXiv 2023.10) Interactive Robot Learning from Verbal Correction, [Paper], [Project]
(arXiv 2023.10) Exploring Question Decomposition for Zero-Shot VQA, [Paper], [Project]
(arXiv 2023.10) RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open Environments, [Paper], [Project]
(arXiv 2023.10) An Early Evaluation of GPT-4V(ision), [Paper], [Code]
(arXiv 2023.10) DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models, [Paper], [Project]
(arXiv 2023.10) CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images, [Paper], [Code]
(arXiv 2023.10) VIDEOPROMPTER: AN ENSEMBLE OF FOUNDATIONAL MODELS FOR ZERO-SHOT VIDEO UNDERSTANDING, [Paper]
(arXiv 2023.10) Inject Semantic Concepts into Image Tagging for Open-Set Recognition, [Paper], [Code]
(arXiv 2023.10) Woodpecker: Hallucination Correction for Multimodal Large Language Models, [Paper], [Code]
(arXiv 2023.10) Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models, [Paper], [Code]
(arXiv 2023.10) Large Language Models are Temporal and Causal Reasoners for Video Question Answering, [Paper], [Code]
(arXiv 2023.10) What’s Left? Concept Grounding with Logic-Enhanced Foundation Models, [Paper]
(arXiv 2023.10) Evaluating Spatial Understanding of Large Language Models, [Paper]
(arXiv 2023.10) Learning Reward for Physical Skills using Large Language Model, [Paper]
(arXiv 2023.10) CREATIVE ROBOT TOOL USE WITH LARGE LANGUAGE MODELS, [Paper], [Project]
(arXiv 2023.10) Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models, [Paper], [Project]
(arXiv 2023.10) Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning, [Paper], [Project]
(arXiv 2023.10) LARGE LANGUAGE MODELS CAN Share IMAGES, TOO! [Paper], [Code]
(arXiv 2023.10) Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond, [Paper]
(arXiv 2023.10) HALLUSIONBENCH: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(vision), LLaVA-1.5, and Other Multi-modality Models, [Paper], [Code]
(arXiv 2023.10) Can Language Models Laugh at YouTube Short-form Videos? [Paper], [Code]
(arXiv 2023.10) Large Language Models are Visual Reasoning Coordinators, [Paper], [Code]
(arXiv 2023.10) Language Models as Zero-Shot Trajectory Generators, [Paper], [Project]
(arXiv 2023.10) Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge, [Paper], [Code]
(arXiv 2023.10) Multimodal Large Language Model for Visual Navigation, [Paper]
(arXiv 2023.10) MAKING MULTIMODAL GENERATION EASIER: WHEN DIFFUSION MODELS MEET LLMS, [Paper], [Code]
(arXiv 2023.10) Open X-Embodiment: Robotic Learning Datasets and RT-X Models, [Paper], [Project]
(arXiv 2023.10) Large Language Models Meet Open-World Intent Discovery and Recognition: An Evaluation of ChatGPT, [Paper], [Code]
(arXiv 2023.10) Lost in Translation: When GPT-4V(ision) Can’t See Eye to Eye with Text A Vision-Language-Consistency Analysis of VLLMs and Beyond, [Paper]
(arXiv 2023.10) Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models, [Paper]
(arXiv 2023.10) VLIS: Unimodal Language Models Guide Multimodal Language Generation, [Paper], [Code]
(arXiv 2023.10) CLIN: A CONTINUALLY LEARNING LANGUAGE AGENT FOR RAPID TASK ADAPTATION AND GENERALIZATION, [Paper], [Project]
(arXiv 2023.10) Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning, [Paper]
(arXiv 2023.10) Lost in Translation: When GPT-4V(ision) Can’t See Eye to Eye with Text A Vision-Language-Consistency Analysis of VLLMs and Beyond, [Paper]
(arXiv 2023.10) FROZEN TRANSFORMERS IN LANGUAGE MODELS ARE EFFECTIVE VISUAL ENCODER LAYERS, [Paper], [Code]
(arXiv 2023.10) CLAIR: Evaluating Image Captions with Large Language Models, [Paper], [Project]
(arXiv 2023.10) 3D-GPT: PROCEDURAL 3D MODELING WITH LARGE LANGUAGE MODELS, [Paper], [Project]
(arXiv 2023.10) Automated Natural Language Explanation of Deep Visual Neurons with Large Models, [Paper]
(arXiv 2023.10) Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, [Paper], [Project]
(arXiv 2023.10) EvalCrafter: Benchmarking and Evaluating Large Video Generation Models, [Paper], [Project]
(arXiv 2023.10) MISAR: A MULTIMODAL INSTRUCTIONAL SYSTEM WITH AUGMENTED REALITY, [Paper], [Code]
(arXiv 2023.10) NON-INTRUSIVE ADAPTATION: INPUT-CENTRIC PARAMETER-EFFICIENT FINE-TUNING FOR VERSATILE MULTIMODAL MODELING, [Paper]
(arXiv 2023.10) LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation, [Paper], [Project]
(arXiv 2023.10) ChatGPT-guided Semantics for Zero-shot Learning, [Paper]
(arXiv 2023.10) On the Benefit of Generative Foundation Models for Human Activity Recognition, [Paper]
(arXiv 2023.10) DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning, [Paper], [Project]
(arXiv 2023.10) Interactive Task Planning with Language Models, [Paper], [Project]
(arXiv 2023.10) Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance, [Paper], [Project]
(arXiv 2023.10) Penetrative AI: Making LLMs Comprehend the Physical World, [Paper]
(arXiv 2023.10) BONGARD-OPENWORLD: FEW-SHOT REASONING FOR FREE-FORM VISUAL CONCEPTS IN THE REAL WORLD, [Paper], [Project]
(arXiv 2023.10) ViPE: Visualise Pretty-much Everything, [Paper]
(arXiv 2023.10) MINIGPT-V2: LARGE LANGUAGE MODEL AS A UNIFIED INTERFACE FOR VISION-LANGUAGE MULTITASK LEARNING, [Paper], [Project]
(arXiv 2023.10) MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations, [Paper]
(arXiv 2023.10) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GENERATION WITH COMPLEX AND DETAILED PROMPTS, [Paper]
(arXiv 2023.10) VIDEO LANGUAGE PLANNING, [Paper], [Project]
(arXiv 2023.10) Dobby: A Conversational Service Robot Driven by GPT-4, [Paper]
(arXiv 2023.10) CoPAL: Corrective Planning of Robot Actions with Large Language Models, [Paper]
(arXiv 2023.10) Forgetful Large Language Models: Lessons Learned from Using LLMs in Robot Programming, [Paper]
(arXiv 2023.10) TREE-PLANNER: EFFICIENT CLOSE-LOOP TASK PLANNING WITH LARGE LANGUAGE MODELS, [Paper], [Project]
(arXiv 2023.10) TOWARDS ROBUST MULTI-MODAL REASONING VIA MODEL SELECTION, [Paper], [Code]
(arXiv 2023.10) FERRET: REFER AND GROUND ANYTHING ANYWHERE AT ANY GRANULARITY, [Paper], [Code]
(arXiv 2023.10) FROM SCARCITY TO EFFICIENCY: IMPROVING CLIP TRAINING VIA VISUAL-ENRICHED CAPTIONS, [Paper]
(arXiv 2023.10) OPENLEAF: OPEN-DOMAIN INTERLEAVED IMAGE-TEXT GENERATION AND EVALUATION, [Paper]
(arXiv 2023.10) Can We Edit Multimodal Large Language Models? [Paper], [Code]
(arXiv 2023.10) VISUAL DATA-TYPE UNDERSTANDING DOES NOT EMERGE FROM SCALING VISION-LANGUAGE MODELS, [Paper], [Code]
(arXiv 2023.10) Idea2Img: Iterative Self-Refinement with GPT-4V(vision) for Automatic Image Design and Generation, [Paper], [Project]
(arXiv 2023.10) OCTOPUS: EMBODIED VISION-LANGUAGE PROGRAMMER FROM ENVIRONMENTAL FEEDBACK, [Paper], [Project]

2023.9

(arXiv 2023.9) DynaCon: Dynamic Robot Planner with Contextual Awareness via LLMs, [Paper], [Project]
(arXiv 2023.9) AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model, [Paper]
(arXiv 2023.9) ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning, [Paper], [Project]
(arXiv 2023.9) LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement, [Paper], [Code]
(arXiv 2023.9) ONE FOR ALL: VIDEO CONVERSATION IS FEASIBLE WITHOUT VIDEO INSTRUCTION TUNING, [Paper]
(arXiv 2023.9) Verifiable Learned Behaviors via Motion Primitive Composition: Applications to Scooping of Granular Media, [Paper]
(arXiv 2023.9) Human-Assisted Continual Robot Learning with Foundation Models, [Paper], [Project]
(arXiv 2023.9) InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition, [Paper], [Code]
(arXiv 2023.9) VIDEODIRECTORGPT: CONSISTENT MULTI-SCENE VIDEO GENERATION VIA LLM-GUIDED PLANNING, [Paper], [Project]
(arXiv 2023.9) Text-to-Image Generation for Abstract Concepts, [Paper]
(arXiv 2023.9) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator, [Paper], [Code]
(arXiv 2023.9) ALIGNING LARGE MULTIMODAL MODELS WITH FACTUALLY AUGMENTED RLHF, [Paper], [Project]
(arXiv 2023.9) Self-Recovery Prompting: Promptable General Purpose Service Robot System with Foundation Models and Self-Recovery, [Paper], [Project]
(arXiv 2023.9) Q-BENCH: A BENCHMARK FOR GENERAL-PURPOSE FOUNDATION MODELS ON LOW-LEVEL VISION, [Paper]
(arXiv 2023.9) DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention, [Paper], [Code]
(arXiv 2023.9) LMC: Large Model Collaboration with Cross-assessment for Training-Free Open-Set Object Recognition, [Paper], [Code]
(arXiv 2023.9) LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent, [Paper], [Project]
(arXiv 2023.9) Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models, [Paper], [Code]
(arXiv 2023.9) STRUCTCHART: PERCEPTION, STRUCTURING, REASONING FOR VISUAL CHART UNDERSTANDING, [Paper]
(arXiv 2023.9) DREAMLLM: SYNERGISTIC MULTIMODAL COMPREHENSION AND CREATION, [Paper], [Project]
(arXiv 2023.9) A LARGE-SCALE DATASET FOR AUDIO-LANGUAGE REPRESENTATION LEARNING, [Paper], [Project]
(arXiv 2023.9) YOU ONLY LOOK AT SCREENS: MULTIMODAL CHAIN-OF-ACTION AGENTS, [Paper], [Code]
(arXiv 2023.9) SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models, [Paper], [Project]
(arXiv 2023.9) Conformal Temporal Logic Planning using Large Language Models: Knowing When to Do What and When to Ask for Help, [Paper], [Project]
(arXiv 2023.9) Investigating the Catastrophic Forgetting in Multimodal Large Language Models, [Paper]
(arXiv 2023.9) Specification-Driven Video Search via Foundation Models and Formal Verification, [Paper]
(arXiv 2023.9) Language as the Medium: Multimodal Video Classification through text only, [Paper]
(arXiv 2023.9) Multimodal Foundation Models: From Specialists to General-Purpose Assistants, [Paper]
(arXiv 2023.9) TEXTBIND: Multi-turn Interleaved Multimodal Instruction-following, [Paper], [Project]
(arXiv 2023.9) Prompt a Robot to Walk with Large Language Models, [Paper], [Project]
(arXiv 2023.9) Grasp-Anything: Large-scale Grasp Dataset from Foundation Models, [Paper], [Project]
(arXiv 2023.9) MMICL: EMPOWERING VISION-LANGUAGE MODEL WITH MULTI-MODAL IN-CONTEXT LEARNING, [Paper], [Code]
(arXiv 2023.9) SwitchGPT: Adapting Large Language Models for Non-Text Outputs, [Paper], [Code]
(arXiv 2023.9) UNIFIED HUMAN-SCENE INTERACTION VIA PROMPTED CHAIN-OF-CONTACTS, [Paper], [Code]
(arXiv 2023.9) Incremental Learning of Humanoid Robot Behavior from Natural Interaction and Large Language Models, [Paper]
(arXiv 2023.9) NExT-GPT: Any-to-Any Multimodal LLM, [Paper], [Project]
(arXiv 2023.9) Multi3DRefer: Grounding Text Description to Multiple 3D Objects, [Paper], [Project]
(arXiv 2023.9) Language Models as Black-Box Optimizers for Vision-Language Models, [Paper]
(arXiv 2023.9) Evaluation and Mitigation of Agnosia in Multimodal Large Language Models, [Paper]
(arXiv 2023.9) Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models, [Paper], [Code]
(arXiv 2023.9) Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment, [Paper]
(arXiv 2023.9) ImageBind-LLM: Multi-modality Instruction Tuning, [Paper], [Code]
(arXiv 2023.9) Developmental Scaffolding with Large Language Models, [Paper]
(arXiv 2023.9) Gesture-Informed Robot Assistance via Foundation Models, [Paper], [Project]
(arXiv 2023.9) Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging, [Paper]
(arXiv 2023.9) Large AI Model Empowered Multimodal Semantic Communications, [Paper]
(arXiv 2023.9) CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection, [Paper], [Project]
(arXiv 2023.9) Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning, [Paper]
(arXiv 2023.9) CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning, [Paper]
(arXiv 2023.9) Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following, [Paper], [Code]

2023.8

(arXiv 2023.8) Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images, [Paper], [Project]
(arXiv 2023.8) Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis, [Paper]
(arXiv 2023.8) Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models, [Paper], [Code]
(arXiv 2023.8) PointLLM: Empowering Large Language Models to Understand Point Clouds, [Paper], [Project]
(arXiv 2023.8) TouchStone: Evaluating Vision-Language Models by Language Models, [Paper], [Code]
(arXiv 2023.8) WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model, [Paper]
(arXiv 2023.8) ISR-LLM: Iterative Self-Refined Large Language Model for Long-Horizon Sequential Task Planning, [Paper], [Code]
(arXiv 2023.8) LLM-Based Human-Robot Collaboration Framework for Manipulation Tasks, [Paper]
(arXiv 2023.8) Evaluation and Analysis of Hallucination in Large Vision-Language Models, [Paper]
(arXiv 2023.8) MLLM-DataEngine: An Iterative Refinement Approach for MLLM, [Paper]
(arXiv 2023.8) Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models, [Paper]
(arXiv 2023.8) Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining? [Paper], [Code]
(arXiv 2023.8) VIGC: Visual Instruction Generation and Correction, [Paper]
(arXiv 2023.8) Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment, [Paper]
(arXiv 2023.8) Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities, [Paper], [Code]
(arXiv 2023.8) DIFFUSION LANGUAGE MODELS CAN PERFORM MANY TASKS WITH SCALING AND INSTRUCTION-FINETUNING, [Paper], [Code]
(arXiv 2023.8) CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images, [Paper], [Project]
(arXiv 2023.8) ProAgent: Building Proactive Cooperative AI with Large Language Models, [Paper], [Project]
(arXiv 2023.8) ROSGPT_Vision: Commanding Robots Using Only Language Models’ Prompts, [Paper], [Code]
(arXiv 2023.8) StoryBench: A Multifaceted Benchmark for Continuous Story Visualization, [Paper], [Code]
(arXiv 2023.8) Tackling Vision Language Tasks Through Learning Inner Monologues, [Paper]
(arXiv 2023.8) ExpeL: LLM Agents Are Experiential Learners, [Paper]
(arXiv 2023.8) On the Adversarial Robustness of Multi-Modal Foundation Models, [Paper]
(arXiv 2023.8) WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models, [Paper], [Project]
(arXiv 2023.8) March in Chat: Interactive Prompting for Remote Embodied Referring Expression, [Paper], [Code]
(arXiv 2023.8) BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions, [Paper], [Code]
(arXiv 2023.8) VIT-LENS: Towards Omni-modal Representations, [Paper], [Code]
(arXiv 2023.8) StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data, [Paper], [Project]
(arXiv 2023.8) PUMGPT: A Large Vision-Language Model for Product Understanding, [Paper]
(arXiv 2023.8) Link-Context Learning for Multimodal LLMs, [Paper], [Code]
(arXiv 2023.8) Detecting and Preventing Hallucinations in Large Vision Language Models, [Paper]
(arXiv 2023.8) VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use, [Paper], [Project]
(arXiv 2023.8) Foundation Model based Open Vocabulary Task Planning and Executive System for General Purpose Service Robots, [Paper]
(arXiv 2023.8) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation, [Paper], [Project]
(arXiv 2023.8) OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation, [Paper]
(arXiv 2023.8) EMPOWERING VISION-LANGUAGE MODELS TO FOLLOW INTERLEAVED VISION-LANGUAGE INSTRUCTIONS, [Paper], [Code]
(arXiv 2023.8) 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment, [Paper], [Project]
(arXiv 2023.8) Gentopia.AI: A Collaborative Platform for Tool-Augmented LLMs, [Paper], [Project]
(arXiv 2023.8) AgentBench: Evaluating LLMs as Agents, [Paper], [Project]
(arXiv 2023.8) Learning Concise and Descriptive Attributes for Visual Recognition, [Paper]
(arXiv 2023.8) Tiny LVLM-eHub: Early Multimodal Experiments with Bard, [Paper], [Project]
(arXiv 2023.8) MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities, [Paper], [Code]
(arXiv 2023.8) RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension, [Paper], [Code]
(arXiv 2023.8) Learning to Model the World with Language, [Paper], [Project]
(arXiv 2023.8) The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World, [Paper], [Code]
(arXiv 2023.8) Multimodal Neurons in Pretrained Text-Only Transformers, [Paper], [Project]
(arXiv 2023.8) LISA: REASONING SEGMENTATION VIA LARGE LANGUAGE MODEL, [Paper], [Code]

2023.7

(arXiv 2023.7) Caption Anything: Interactive Image Description with Diverse Multimodal Controls, [Paper], [Code]
(arXiv 2023.7) DesCo: Learning Object Recognition with Rich Language Descriptions, [Paper]
(arXiv 2023.7) KOSMOS-2: Grounding Multimodal Large Language Models to the World, [Paper], [Project]
(arXiv 2023.7) MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models, [Paper], [Code]
(arXiv 2023.7) Evaluating ChatGPT and GPT-4 for Visual Programming, [Paper]
(arXiv 2023.7) SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension, [Paper], [Code]
(arXiv 2023.7) AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? [Paper], [Project]
(arXiv 2023.7) Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks, [Paper]
(arXiv 2023.7) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding, [Paper], [Project]
(arXiv 2023.7) Large Language Models as General Pattern Machines, [Paper], [Project]
(arXiv 2023.7) How Good is Google Bard’s Visual Understanding? An Empirical Study on Open Challenges, [Paper], [Project]
(arXiv 2023.7) RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, [Paper], [Project]
(arXiv 2023.7) Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition, [Paper], [Project]
(arXiv 2023.7) GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping, [Paper], [Project]
(arXiv 2023.7) CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots, [Paper]
(arXiv 2023.7) 3D-LLM: Injecting the 3D World into Large Language Models, [Paper], [Project]
(arXiv 2023.7) Generative Pretraining in Multimodality, [Paper], [Code]
(arXiv 2023.7) VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, [Paper], [Project]
(arXiv 2023.7) VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View, [Paper]
(arXiv 2023.7) SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning, [Paper], [Project]
(arXiv 2023.7) Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts, [Paper]
(arXiv 2023.7) InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, [Paper], [Data]
(arXiv 2023.7) MBLIP: EFFICIENT BOOTSTRAPPING OF MULTILINGUAL VISION-LLMS, [Paper], [Code]
(arXiv 2023.7) Bootstrapping Vision-Language Learning with Decoupled Language Pre-training, [Paper]
(arXiv 2023.7) BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs, [Paper], [Project]
(arXiv 2023.7) ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning, [Paper], [Project]
(arXiv 2023.7) TOWARDS A UNIFIED AGENT WITH FOUNDATION MODELS, [Paper]
(arXiv 2023.7) Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners, [Paper], [Project]
(arXiv 2023.7) Building Cooperative Embodied Agents Modularly with Large Language Models, [Paper], [Project]
(arXiv 2023.7) Embodied Task Planning with Large Language Models, [Paper], [Project]
(arXiv 2023.7) What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?, [Paper], [Project]
(arXiv 2023.7) GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest, [Paper], [Code]
(arXiv 2023.7) JourneyDB: A Benchmark for Generative Image Understanding, [Paper], [Code]
(arXiv 2023.7) DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment, [Paper], [Project]
(arXiv 2023.7) Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset, [Paper], [Code]
(arXiv 2023.7) Visual Instruction Tuning with Polite Flamingo, [Paper], [Code]
(arXiv 2023.7) Statler: State-Maintaining Language Models for Embodied Reasoning, [Paper], [Project]
(arXiv 2023.7) SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions, [Paper]
(arXiv 2023.7) SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs, [Paper], [Code]
(arXiv 2023.7) KITE: Keypoint-Conditioned Policies for Semantic Manipulation, [Paper], [Project]

2023.6

(arXiv 2023.6) LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark, [Paper], [Code]
(arXiv 2023.6) Scalable 3D Captioning with Pretrained Models, [Paper], [Code]
(arXiv 2023.6) AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers, [Paper], [Code]
(arXiv 2023.6) VALLEY: VIDEO ASSISTANT WITH LARGE LANGUAGE MODEL ENHANCED ABILITY, [Paper], [Code]
(arXiv 2023.6) Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots, [Paper]
(arXiv 2023.6) LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models, [Paper]
(arXiv 2023.6) AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn, [Paper], [Project]
(arXiv 2023.6) Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models, [Paper]
(arXiv 2023.6) MACAW-LLM: MULTI-MODAL LANGUAGE MODELING WITH IMAGE, AUDIO, VIDEO, AND TEXT INTEGRATION, [Paper], [Code]
(arXiv 2023.6) Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering, [Paper]
(arXiv 2023.6) Language to Rewards for Robotic Skill Synthesis, [Paper], [Project]
(arXiv 2023.6) Toward Grounded Social Reasoning, [Paper], [Code]
(arXiv 2023.6) Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion, [Paper], [Code]
(arXiv 2023.6) RM-PRT: Realistic Robotic Manipulation Simulator and Benchmark with Progressive Reasoning Tasks, [Paper], [Code]
(arXiv 2023.6) Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning, [Paper], [Project]
(arXiv 2023.6) Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language, [Paper], [Code]
(arXiv 2023.6) LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding, [Paper], [Project]
(arXiv 2023.6) OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding, [Paper], [Project]
(arXiv 2023.6) Statler: State-Maintaining Language Models for Embodied Reasoning, [Paper], [Project]
(arXiv 2023.6) CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents, [Paper]
(arXiv 2023.6) Mass-Producing Failures of Multimodal Systems with Language Models, [Paper], [Code]
(arXiv 2023.6) SoftGPT: Learn Goal-oriented Soft Object Manipulation Skills by Generative Pre-trained Heterogeneous Graph Transformer, [Paper]
(arXiv 2023.6) SPRINT: SCALABLE POLICY PRE-TRAINING VIA LANGUAGE INSTRUCTION RELABELING, [Paper], [Project]
(arXiv 2023.6) MotionGPT: Finetuned LLMs are General-Purpose Motion Generators, [Paper], [Project]
(arXiv 2023.6) MIMIC-IT: Multi-Modal In-Context Instruction Tuning, [Paper], [Code]
(arXiv 2023.6) Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models, [Paper]

2023.5

(arXiv 2023.5) Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering, [Paper], [Code]
(arXiv 2023.5) VIMA: General Robot Manipulation with Multimodal Prompts, [Paper], [Project]
(arXiv 2023.5) TidyBot: Personalized Robot Assistance with Large Language Models, [Paper], [Project]
(arXiv 2023.5) Training Diffusion Models with Reinforcement Learning, [Paper], [Project]
(arXiv 2023.5) EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought, [Paper], [Project]
(arXiv 2023.5) ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4, [Paper], [Code]
(arXiv 2023.5) Evaluating Object Hallucination in Large Vision-Language Models, [Paper], [Code]
(arXiv 2023.5) LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation, [Paper], [Code]
(arXiv 2023.5) VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks, [Paper], [Code]
(arXiv 2023.5) OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding, [Paper], [Project]
(arXiv 2023.5) Towards A Foundation Model for Generalist Robots: Diverse Skill Learning at Scale via Automated Task and Scene Generation, [Paper]
(arXiv 2023.5) An Android Robot Head as Embodied Conversational Agent, [Paper]
(arXiv 2023.5) Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model, [Paper], [Code]
(arXiv 2023.5) Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision, [Paper], [Project]
(arXiv 2023.5) Multimodal Procedural Planning via Dual Text-Image Prompting, [Paper], [Code]
(arXiv 2023.5) ArK: Augmented Reality with Knowledge Interactive Emergent Ability, [Paper]

2023.4

(arXiv 2023.4) LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model, [Paper], [Code]
(arXiv 2023.4) Multimodal Grounding for Embodied AI via Augmented Reality Headsets for Natural Language Driven Task Planning, [Paper]
(arXiv 2023.4) mPLUG-Owl : Modularization Empowers Large Language Models with Multimodality, [Paper], [Code]
(arXiv 2023.4) ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System, [Paper], [Project]
(arXiv 2023.4) ChatABL: Abductive Learning via Natural Language Interaction with ChatGPT, [Paper]
(arXiv 2023.4) Robot-Enabled Construction Assembly with Automated Sequence Planning based on ChatGPT: RoboGPT, [Paper]
(arXiv 2023.4) Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT, [Paper], [Code]
(arXiv 2023.4) Can GPT-4 Perform Neural Architecture Search?, [Paper], [Code]
(arXiv 2023.4) MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models, [Paper], [Project]
(arXiv 2023.4) SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation, [Paper], [Project]
(arXiv 2023.4) LLM as A Robotic Brain: Unifying Egocentric Memory and Control, [Paper]
(arXiv 2023.4) Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models, [Paper], [Project]
(arXiv 2023.4) Visual Instruction Tuning, [Paper], [Project]
(arXiv 2023.4) MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models, [Paper], [Project]
(arXiv 2023.4) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment, [Paper], [Code]
(arXiv 2023.4) Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text, [Paper], [Code]
(arXiv 2023.4) ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance, [Paper], [Code]
(arXiv 2023.4) HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, [Paper], [Code]
(arXiv 2023.4) ERRA: An Embodied Representation and Reasoning Architecture for Long-horizon Language-conditioned Manipulation Tasks, [Paper], [Code]
(arXiv 2023.4) Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT, [Paper]
(arXiv 2023.4) ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application, [Paper], [Code]
(arXiv 2023.4) OpenAGI: When LLM Meets Domain Experts, [Paper], [Code]
(arXiv 2023.4) Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions, [Paper], [Code]

2023.3

(arXiv 2023.3) Open-World Object Manipulation using Pre-Trained Vision-Language Models, [Paper], [Project]
(arXiv 2023.3) Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control, [Paper], [Project]
(arXiv 2023.3) Task and Motion Planning with Large Language Models for Object Rearrangement, [Paper], [Project]
(arXiv 2023.3) RE-MOVE: An Adaptive Policy Design Approach for Dynamic Environments via Language-Based Feedback, [Paper], [Project]
(arXiv 2023.3) Chat with the Environment: Interactive Multimodal Perception using Large Language Models, [Paper]
(arXiv 2023.3) MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge, [Paper], [Code]
(arXiv 2023.3) DialogPaint: A Dialog-based Image Editing Model, [Paper]
(arXiv 2023.3) MM-REACT : Prompting ChatGPT for Multimodal Reasoning and Action, [Paper], [Project]
(arXiv 2023.3) eP-ALM: Efficient Perceptual Augmentation of Language Models, [Paper], [Code]
(arXiv 2023.3) Errors are Useful Prompts: Instruction Guided Task Programming with Verifier-Assisted Iterative Prompting, [Paper], [Project]
(arXiv 2023.3) LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, [Paper], [Code]
(arXiv 2023.3) MULTIMODAL ANALOGICAL REASONING OVER KNOWLEDGE GRAPHS, [Paper], [Code]
(arXiv 2023.3) CAN LARGE LANGUAGE MODELS DESIGN A ROBOT? [Paper]
(arXiv 2023.3) Learning video embedding space with Natural Language Supervision, [Paper]
(arXiv 2023.3) Audio Visual Language Maps for Robot Navigation, [Paper], [Project]
(arXiv 2023.3) ViperGPT: Visual Inference via Python Execution for Reasoning, [Paper]
(arXiv 2023.3) ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions, [Paper], [Code]
(arXiv 2023.3) Can an Embodied Agent Find Your “Cat-shaped Mug”? LLM-Based Zero-Shot Object Navigation, [Paper], [Project]
(arXiv 2023.3) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models, [Paper], [Code]
(arXiv 2023.3) PaLM-E: An Embodied Multimodal Language Model, [Paper], [Project]
(arXiv 2023.3) Language Is Not All You Need: Aligning Perception with Language Models, [Paper], [Code]

2023.2

(arXiv 2023.2) ChatGPT for Robotics: Design Principles and Model Abilities, , [Paper], [Code]
(arXiv 2023.2) Internet Explorer: Targeted Representation Learning on the Open Web, [Paper], [Project]

2022.11

(arXiv 2022.11) Visual Programming: Compositional visual reasoning without training, [Paper], [Project]

2022.7

(arXiv 2022.7) Language Models are General-Purpose Interfaces, [Paper], [Code]
(arXiv 2022.7) LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action, [Paper], [Project]

Lum1104 / LLM-in-Vision

LLM-in-Vision

2023.11

2023.10

2023.9

2023.8

2023.7

2023.6

2023.5

2023.4

2023.3

2023.2

2022.11

2022.7

About