yc1999 / ToolQA

ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.

Home Page:https://arxiv.org/pdf/2306.13304.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

🛠️ToolQA

🛠️ The official repository for code and data of ToolQA dataset. ToolQA is a open-source dataset specifically designed for evaluations on tool-augmented large language models (LLMs). This repo provides the dataset, the corresponding data generation code, and the implementations of baselines on our dataset.

Features

  • Our questions are selected and guaranteed that LLMs have little chance to memorize and answer correctly within their internal knowledge;
  • The majority of the questions in ToolQA require compositional use of multiple tools. According to the length of toolchains, we offer two different difficult levels of dataset: Easy and Hard.
  • We apply a thorough diagnosis and analysis of in-context tool-augmented LLMs in our paper.
  • ToolQA is created via collaboration between humans and AI, adaptable to new data and questions with automation.

Data Download

We offer the download link for all the data involved in ToolQA. We offer two categories of data for download and use. The first category is external corpus. This sort of data have already been pre-processed by us and they are used for external tools to interact, (e.g., retrieve, database operations, etc.). The second category of data is the raw data, which cannot be used as external knowledge of ToolQA to interact. We offer this part of data just for users if they want to generate more questions and answers for model tuning or thorough evaluation.

External Corpus Download

The external corpus can be downloaded through this link. After downloading and unzipping, users need to place it under the directory /<YOUR_OWN_PATH>/ToolQA/data/external_corpus/.

Raw Data Download

All the data sources and download guidance are listed below:

  • Flight: You can download the raw flight data from the Download Link. Please place the Combined_Flights_2022.csv file under the directory /<YOUR_OWN_PATH>/ToolQA/data/raw_data/flights/.
  • Coffee: You can download the raw coffee data from the Download Link. Please place the coffee.csv file under the directory /<YOUR_OWN_PATH>/ToolQA/data/raw_data/coffee/.
  • Yelp: You can download the raw Yelp data from the Download Link. Please place the yelp_academic_dataset_business.json file under the directory /<YOUR_OWN_PATH>/ToolQA/data/raw_data/yelp/.
  • Airbnb: You can download the raw Airbnb data from the Download Link. Please place the Airbnb_Open_data.csv file under the directory /<YOUR_OWN_PATH>/ToolQA/data/raw_data/airbnb/.
  • DBLP: You can download the raw DBLP data from the Download Link. Please place the DBLP-Citation-network V14.zip file under the directory /<YOUR_OWN_PATH>/ToolQA/data/raw_data/dblp/.
  • GSM8K: You can download the raw GSM8K data from the Download Link. Please run ChatGPT vanilla on the raw questions and place the result file gsm.chat.jsonl under the directory /<YOUR_OWN_PATH>/ToolQA/data/raw_data/gsm8k/.
  • SciREX: You can download the raw SciREX data from the Download Link. Please place the dataset files train.jsonl, val.jsonl, and test.jsonl under the directory /<YOUR_OWN_PATH>/ToolQA/data/raw_data/scirex/.
  • Agenda: You can download the raw data from our prepared Download Link. Please place the file agenda_events.jsonl under the directory /<YOUR_OWN_PATH>/ToolQA/data/raw_data/agenda/.

Generate New Questions

You can also use the ToolQA to generate new questions under our templates for tuning and new sets of evalations. We offer the data generation code in /dataset_generation/ directory. The only thing to do is to modify the paths in the notebooks.

Tool Implementation

We offer a list of implemented tools in each of the baselines in the benchmark, like ./benchmark/ReAct/code/tools. Please note that the questions are intentionally designed to be open-ended. This reflects our belief that these questions pose sufficient challenges, and we don't wish to limit the tools suggested in our paper. We welcome experiments with more advanced tools (like a superior retriever) to enhance performance or devising a more effective planning module for better compositional usage of our defined tools. Therefore, we are excited to see diverse implementations in response to all our questions.

Retriever

We implement the retriever with Langchain package and the Chroma vector database. We have uploaded the pre-processed chroma vectorbase in the Download Link. Please download the file under the directory /<YOUR_OWN_PATH>/ToolQA/data/chroma_db/.

SQL Interpreter

To interprete SQL commands, the user may need to load the database into the mysql database first. You can run the following commands for database creation (the entire process may take hours):

python ./benchmark/ReAct/code/tools/table/mysql_db_create.py

Math Calculator

To use the calculator in the implementation. You first need to sign up an account through the official Wolframalpha developer portal.

Current Progress

The data and code are in the final stage of cleaning and will be public gradually in a very short period. We offer the detailed progress of the final examination in the TODO list part.

To Do List -- Easy Questions

  • Flight-easy questions and code;
  • Coffee-easy questions and code;
  • Yelp-easy questions and code;
  • Airbnb-easy questions and code;
  • SciREX-easy questions and code;
  • Agenda-easy questions and code;
  • GSM8K-easy questions and code;
  • DBLP-easy questions and code;

To Do List -- Hard Questions

  • Flight-hard questions and code;
  • Coffee-hard questions and code;
  • Yelp-hard questions and code;
  • Airbnb-hard questions and code;
  • SciREX-hard questions and code;
  • Agenda-hard questions and code;
  • DBLP-hard questions and code;

Benchmark Code Release

  • Tool Implementations;
  • ChatGPT Vanilla;
  • ReAct;
  • Chameleon;

Questions?

If you have any questions, feel free to reach out to yczhuang at gatech.edu. Please try to specify the problem with details so we can help you better and quicker!

Citation

If you find this repository valuable for your research, we kindly request that you acknowledge our paper by citing the following paper. We appreciate your consideration.

@misc{zhuang2023toolqa,
      title={ToolQA: A Dataset for LLM Question Answering with External Tools}, 
      author={Yuchen Zhuang and Yue Yu and Kuan Wang and Haotian Sun and Chao Zhang},
      year={2023},
      eprint={2306.13304},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.

https://arxiv.org/pdf/2306.13304.pdf

License:Apache License 2.0


Languages

Language:Jupyter Notebook 72.2%Language:Python 27.8%