🛠️ Awesome LMs with Tools

Language models (LMs) are powerful yet mostly for text-generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills.

Based on our recent survey about LM-used tools, "What Are Tools Anyway? A Survey from the Language Model Perspective", we provide a structured list of literature relevant to tool-augmented LMs.

Tool basics ($\S2$)
Tool use paradigm ($\S3$)
Scenarios ($\S4$)
Advanced methods ($\S5$)
Evaluation ($\S6$)

If you find our paper or code useful, please cite the paper:

@article{wang2022what,
  title={What Are Tools Anyway? A Survey from the Language Model Perspective},
  author={Zhiruo Wang, Zhoujun Cheng, Hao Zhu, Daniel Fried, Graham Neubig},
  journal={arXiv preprint arXiv:2403.15452},
  year={2024}
}

$\S2$ Tool Basics

$\S2.1$ What are tools? 🛠️

Definition and discussion of animal-used tools

Animal tool behavior: the use and manufacture of tools by animals Shumaker, Robert W., Kristina R. Walkup, and Benjamin B. Beck. 2011 [Book]
Early discussions on LM-used tools

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Qin, Yujia, et al. 2023.07 [Paper]
A survey on augmented LMs, including tool augmentation

Augmented Language Models: a Survey Mialon, Grégoire, et al. 2023.02 [Paper]

$\S2.3$ Tools and "Agents" 🤖

Definition of agents

Artificial intelligence a modern approach Russell, Stuart J., and Peter Norvig. 2016 [Book]
Survey about agents that perceive and act in the environment

The Rise and Potential of Large Language Model Based Agents: A Survey Xi, Zhiheng, et al. 2023.09 [Preprint]
Survey about the cognitive architectures for language agents

Cognitive Architectures for Language Agents Sumers, Theodore R., et al. 2023.09 [Paper]

$\S3$ The basic tool use paradigm

Early works that set up the commonly used tooling paradigm

Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 [Paper]

Inference-time prompting

Provide in-context examples for tool-using on visual programming problems

Visual Programming: Compositional visual reasoning without training Gupta, Tanmay, and Aniruddha Kembhavi. 2023 [Paper]
Tool learning via in-context examples on reasoning problems involving text or multi-modal inputs

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Lu, Pan, et al. 2024 [Paper]
In-context learning based tool using for reasoning problems in BigBench and MMLU

ART: Automatic multi-step reasoning and tool-use for large language models Paranjape, Bhargavi, et al. 2023.03 [Preprint]
Providing tool documentation for in-context tool learning

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models Hsieh, Cheng-Yu, et al. 2023.08 [Preprint]

Learning by training

Training on human annotated examples of (NL input, tool-using solution output) pairs

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs Li, Minghao, et al. 2023.12 [Paper]

Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems Kadlčík, Marek, et al. 2023 [Paper]
Training on model-synthesized examples

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang, Qiaoyu, et al. 2023.06 [Preprint]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Qin, Yujia, et al. 2023.07 [Paper]

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use Huang, Yue, et al. 2023.10 [Paper]

Making Language Models Better Tool Learners with Execution Feedback Qiao, Shuofei, et al. 2023.05 [Preprint]

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error Wang, Boshi, et al. 2024.03 [Preprint]
Self-training with bootstrapped examples

Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 Paper

$\S4$ Scenarios

Knowledge access 📚

Collect data from structured knowledge sources, e.g., databases, knowledge graphs, etc.

LaMDA: Language Models for Dialog Applications Thoppilan, Romal, et al. 2022.01 [Paper]

TALM: Tool Augmented Language Models Parisi, Aaron, Yao Zhao, and Noah Fiedel. 2022.05 [Preprint]

ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings Hao, Shibo, et al. 2024 [Paper]

ToolQA: A Dataset for LLM Question Answering with External Tools Zhuang, Yuchen, et al. 2024 [Paper]

Middleware for LLMs: Tools are Instrumental for Language Agents in Complex Environments Gu, Yu, et al. 2024 [Paper]

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information Jin, Qiao, et al. 2024 [Paper]
Search information from the web

Internet-augmented language models through few-shot prompting for open-domain question answering Lazaridou, Angeliki, et al. 2022.03 [Paper]

Internet-Augmented Dialogue Generation Komeili, Mojtaba, Kurt Shuster, and Jason Weston. 2022 [Paper]
Viewing retrieval models as tools under the retrieval-augmented generation context

Retrieval-based Language Models and Applications Asai, Akari, et al. 2023 [Tutorial]

Augmented Language Models: a Survey Mialon, Grégoire, et al. 2023.02 [Paper]

Computation activities 🔣

Using calculator for math calculations

Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 [Paper]

Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems Kadlčík, Marek, et al. 2023 [Paper]
Using programs/Python interpreter to perform more complex operations

Pal: Program-aided language models Gao, Luyu, et al. 2023 [Paper]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks Chen, Wenhu, et al. 2022.11 [Paper]

Mint: Evaluating llms in multi-turn interaction with tools and language feedback Wang, Xingyao, et al. 2023.09 [Paper]

MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning Das, Debrup, et al. 2024 [Preprint]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Gou, Zhibin, et al. 2023.09 [Paper]
Tools for more advanced business activities, e.g., financial, medical, education, etc.

On the Tool Manipulation Capability of Open-source Large Language Models Xu, Qiantong, et al. 2023.05 [Paper]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang, Qiaoyu, et al. 2023.06 [Preprint]

Mint: Evaluating llms in multi-turn interaction with tools and language feedback Wang, Xingyao, et al. 2023.09 [Paper]

AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning Jin, Qiao, et al. 2024.02 [Paper]

Interaction with the world 🌐

Access real-time or real-world information such as weather, location, etc.

On the Tool Manipulation Capability of Open-source Large Language Models Xu, Qiantong, et al. 2023.05 [Paper]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang, Qiaoyu, et al. 2023.06 [Preprint]
Managing personal events such as calendar or emails

Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 [Paper]
Tools in embodied environments, e.g., the Minecraft world

Voyager: An Open-Ended Embodied Agent with Large Language Models Wang, Guanzhi, et al. 2023.05 [Paper]
Tools interacting with the physical world

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models Singh, Ishika, et al. 2023 [Paper]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks Shridhar, Mohit, et al. 2020 [Paper]

Autonomous chemical research with large language models Boiko, Daniil A., et al. 2023 [Paper]

Non-textual modalities 🎞️

Tools providing access to information in non-textual modalities

Vipergpt: Visual inference via python execution for reasoning Surís, Dídac, Sachit Menon, and Carl Vondrick. 2023 [Paper]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Yang, Zhengyuan, et al. 2023.03 [Preprint]

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Gao, Difei, et al. 2023.06 [Preprint]
Tools that can answer questions about data in other modalities

Visual Programming: Compositional visual reasoning without training Gupta, Tanmay, and Aniruddha Kembhavi. 2023 [Paper]

Special-skilled models 🤗

Text-generation models that can perform specific tasks, e.g., question answering, machine translation

Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 [Paper]

ART: Automatic multi-step reasoning and tool-use for large language models Paranjape, Bhargavi, et al. 2023.03 [Preprint]
Integration of available models on Huggingface, TorchHub, TensorHub, etc.

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Shen, Yongliang, et al. 2024 [Paper]

Gorilla: Large language model connected with massive apis Patil, Shishir G., et al. 2023.05 [Paper]

Taskbench: Benchmarking large language models for task automation Shen, Yongliang, et al. 2023.11 [Paper]

$\S5$ Advanced methods

$\S5.1$ Complex tool selection and usage 🧐

Train retrievers that map natural language instructions to tool documentation

DocPrompting: Generating Code by Retrieving the Docs Zhou, Shuyan, et al. 2022.07 [Paper]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Qin, Yujia, et al. 2023.07 [Paper]
Ask LMs to write hypothetical tool descriptions and search relevant tools

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets Yuan, Lifan, et al. 2023.09 [Paper]
Complex tool usage, e.g., parallel calls

Function Calling and Other API Updates Eleti, Atty, et al. 2023.06 [Blog]

$\S5.2$ Tools in programmatic contexts 👩‍💻

Domain-specific logical forms to query structured data

Semantic parsing on freebase from question-answer pairs Berant, Jonathan, et al. 2013 [Paper]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task Yu, Tao, et al. 2018.09 [Paper]

Break It Down: A Question Understanding Benchmark Wolfson, Tomer, et al. 2020 [Paper]
Domain-specific actions for agentic tasks such as web navigation

Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration Liu, Evan Zheran, et al. 2018.02 [Paper]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents Yao, Shunyu, et al. 2022.07 [Paper]

Webarena: A realistic web environment for building autonomous agents Zhou, Shuyan, et al. 2023.07 [Paper]
Using external Python libraries as tools

ToolCoder: Teach Code Generation Models to use API search tools Zhang, Kechi, et al. 2023.05 [Paper]
Using expert designed functions as tools to answer questions about images

Visual Programming: Compositional visual reasoning without training Gupta, Tanmay, and Aniruddha Kembhavi. 2023 [Paper]

Vipergpt: Visual inference via python execution for reasoning Surís, Dídac, Sachit Menon, and Carl Vondrick. 2023 [Paper]
Using GPT as a tool to query external Wikipedia knowledge for table-based question answering

Binding Language Models in Symbolic Languages Cheng, Zhoujun, et al. 2022.10 [Paper]
Incorporate QA API and operation APIs to assist table-based question answering

API-Assisted Code Generation for Question Answering on Varied Table Structures Cao, Yihan, et al. 2023.12 [Paper]

$\S5.3$ Tool creation and reuse 👩‍🔬

Approaches to abstract libraries for domain-specific logical forms from a large corpus

DreamCoder: growing generalizable, interpretable knowledge with wake--sleep Bayesian program learning Ellis, Kevin, et al. 2020.06 [Paper]

Leveraging Language to Learn Program Abstractions and Search Heuristics] Wong, Catherine, et al. 2021 [Paper]

Top-Down Synthesis for Library Learning Bowers, Matthew, et al. 2023 [Paper]

LILO: Learning Interpretable Libraries by Compressing and Documenting Code Grand, Gabriel, et al. 2023.10 [Paper]
Make and learn skills (Java programs) in the embodied Minecraft world

Voyager: An Open-Ended Embodied Agent with Large Language Models Wang, Guanzhi, et al. 2023.05 [Paper]
Leverage LMs as tool makers on BigBench tasks

Large Language Models as Tool Makers Cai, Tianle, et al. 2023.05 [Preprint]
Create tools for math and table QA tasks by example-wise tool making

CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation Qian, Cheng, et al. 2023.05 [Paper]
Make tools via heuristic-based training and tool deduplication

CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets Yuan, Lifan, et al. 2023.09 [Paper]
Learning tools by refactoring a small amount of programs

ReGAL: Refactoring Programs to Discover Generalizable Abstractions Stengel-Eskin, Elias, Archiki Prasad, and Mohit Bansal. 2024.01 [Preprint]
A training-free approach to make tools via execution consistency

🎁 TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks Wang, Zhiruo, Daniel Fried, and Graham Neubig. 2024.01 [Preprint]

$\S6$ Evaluation: Testbeds

$\S6.1.1$ Repurposed existing datasets

Datasets that require reasoning over texts

Measuring Mathematical Problem Solving With the MATH Dataset Hendrycks, Dan, et al. 2021.03 [Paper]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Srivastava, Aarohi, et al. 2022.06 [Paper]
Datasets that require reasoning over structured data, e.g., tables

Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning Lu, Pan, et al. 2022.09 [Paper]

Compositional Semantic Parsing on Semi-Structured Tables Pasupat, Panupong, and Percy Liang. 2015 [Paper]

HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation Cheng, Zhoujun, et al. 2022 [Paper]
Datasets that require reasoning over other modalities, e.g., images and image pairs

Gqa: A new dataset for real-world visual reasoning and compositional question answering Hudson, Drew A., and Christopher D. Manning. 2019.02 [Paper]

A Corpus for Reasoning about Natural Language Grounded in Photographs Suhr, Alane, et al. 2019 [Paper]
Example datasets that require retriever model (tool) to solve

Natural Questions: A Benchmark for Question Answering Research Kwiatkowski, Tom, et al. 2019 [Paper]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension Joshi, Mandar, et al. 2017 [Paper]

$\S6.1.2$ Aggregated API benchmarks

Collect RapidAPIs and use models to synthesize examples for evaluation

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Qin, Yujia, et al. 2023.07 [Paper]
Collect APIs from PublicAPIs and use models to synthesize examples

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang, Qiaoyu, et al. 2023.06 [Preprint]
Collect APIs from PublicAPIs and manually annotate examples for evaluation

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs Li, Minghao, et al. 2023.12 [Paper]
Collect APIs from OpenAI plugin list and use models to synthesize examples

MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use Huang, Yue, et al. 2023.10 [Paper]
Collect neural model tools from Huggingface hub, TorchHub, and TensorHub

Gorilla: Large language model connected with massive apis Patil, Shishir G., et al. 2023.05 [Paper]
Collect neural model tools from Huggingface

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Shen, Yongliang, et al. 2024 [Paper]
Collect tools from Huggingface and PublicAPIs

Taskbench: Benchmarking large language models for task automation Shen, Yongliang, et al. 2023.11 [Paper]

zorazrw / awesome-tool-llm