Awesome Large Multimodal Agents

Last update: 03/16/2024

Table of Contents

Papers
- Taxonomy
  - Type Ⅰ
  - Type Ⅱ
  - Type Ⅲ
  - Type Ⅳ
  - Multi-Agent
- Application
Benchmark

Papers

Taxonomy

Type Ⅰ

CLOVA - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
CRAFT - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
ViperGPT - ViperGPT: Visual Inference via Python Execution for Reasoning Github
HuggingGPT - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Github
Chameleon - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Github
Visual ChatGPT - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models Github
AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github
M3 - Towards Robust Multi-Modal Reasoning via Model Selection Github
VisProgram - Visual Programming: Compositional visual reasoning without training
DDCoT - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Github
ASSISTGUI - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Github
GPT-Driver - GPT-Driver: Learning to Drive with GPT Github
LLaVA-Interactive - LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Github
MusicAgent - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Github
AudioGPT - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Github
DroidBot-GPT - DroidBot-GPT: GPT-powered UI Automation for Android Github
GRID - GRID: A Platform for General Robot Intelligence Development Github
DEPS - Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents Github
MM-REACT - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Github
MuLan - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion Github
Mobile-Agent - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception Github

Type Ⅱ

STEVE - See and Think: Embodied Agent in Virtual Environment Github
EMMA - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld Github
MLLM-Tool - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning Github
LLaVA-Plus - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills Github
GPT4Tools - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Github
WebWISE - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
Auto-UI - You Only Look at Screens: Multimodal Chain-of-Action Agents Github

Type Ⅲ

DoraemonGPT - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models Github
ChatVideo - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System Github

Type Ⅳ

JARV IS-1 - JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models Github
AppAgent - AppAgent: Multimodal Agents as Smartphone Users Github
MM-Navigator - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Github
Copilot - Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Github
WavJourney - WavJourney: Compositional Audio Creation with Large Language Models Github
DLAH - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models Github
Cradle - Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study Github

Multi-Agent

MP5 - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception Github
MemoDroid - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
Avis - avis: autonomous visual information seeking with large language model agent
Agent-Smith - Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast Github

Application

💡 Complex Visual Reasoning Tasks

ViperGPT - ViperGPT: Visual Inference via Python Execution for Reasoning Github
HuggingGPT - HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Github
Chameleon - Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Github
Visual ChatGPT - Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models Github
AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github
LLaVA-Plus - LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills Github
GPT4Tools - GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Github
MLLM-Tool - MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning Github
M3 - Towards Robust Multi-Modal Reasoning via Model Selection Github
VisProgram - Visual Programming: Compositional visual reasoning without training
DDCoT - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models Github
Avis - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
CLOVA - CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
CRAFT - CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
MuLan - MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion Github

🎵 Audio Editing & Generation

Copilot - Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Github
MusicAgent - MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Github
AudioGPT - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head Github
WavJourney - WavJourney: Compositional Audio Creation with Large Language Models Github

🤖 Embodied AI & Robotics

JARV IS-1 - JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models Github
DEPS - Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents Github
Octopus - Octopus: Embodied Vision-Language Programmer from Environmental Feedback Github
GRID - GRID: A Platform for General Robot Intelligence Development Github
MP5 - MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception Github
STEVE - See and Think: Embodied Agent in Virtual Environment Github
EMMA - Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld Github
MEIA - Multimodal Embodied Interactive Agent for Cafe Scene

🖱️💻 UI-assistants

AppAgent - AppAgent: Multimodal Agents as Smartphone Users Github
DroidBot-GPT - DroidBot-GPT: GPT-powered UI Automation for Android Github
WebWISE - WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
Auto-UI - You Only Look at Screens: Multimodal Chain-of-Action Agents Github
MemoDroid - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
ASSISTGUI - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Github
MM-Navigator - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Github
AutoDroid - Empowering LLM to use Smartphone for Intelligent Task Automation Github
GPT-4V-Act - GPT-4V-Act: Chromium Copilot Github
Mobile-Agent - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception Github

🎨 Visual Generation & Editing

LLaVA-Interactive - LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Github
MM-REACT - MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Github

🎥 Video Understanding

DoraemonGPT - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models Github
ChatVideo - ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System Github
AssistGPT - AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Github

🚗 Autonomous Driving

GPT-Driver - GPT-Driver: Learning to Drive with GPT Github
DLAH - Drive Like a Human: Rethinking Autonomous Driving with Large Language Models Github

🎮 Game-developer

SmartPlay - SmartPlay: A Benchmark for LLMs as Intelligent Agents Github
VisualWebArena - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks Github
Cradle - Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study Github

Other

FinAgent - A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist
VisionGPT - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Benchmark

SmartPlay - SmartPlay: A Benchmark for LLMs as Intelligent Agents Github
VisualWebArena - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks Github
GAIA - GAIA: a benchmark for General AI Assistants Github
OmniACT - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist

jun0wanan / awesome-large-multimodal-agents

Awesome Large Multimodal Agents

Papers

Taxonomy

Type Ⅰ

Type Ⅱ

Type Ⅲ

Type Ⅳ

Multi-Agent

Application

💡 Complex Visual Reasoning Tasks

🎵 Audio Editing & Generation

🤖 Embodied AI & Robotics

🖱️💻 UI-assistants

🎨 Visual Generation & Editing

🎥 Video Understanding

🚗 Autonomous Driving

🎮 Game-developer

Other

Benchmark

About