llm-planning-eval

Code and data for our paper When is Tree Search Useful for LLM Planning? It Depends on the Discriminator.

Updates:

02/21/24: We have made the initial release of our code and data. Please feel free to open an issue if you run into any problems. Our release includes:
- Experimental setup with third-party resources
- Data and preprocessing
- Code for framework and implementation
- Scripts for intrinsic and end-to-end evaluation

Installation
Experimental Setup
Data and Preprocessing
Evaluation
Citation

1 Installation

Please run the following commands to create a conda environment:

conda env create -f environment.yml
conda activate llm-planning
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
conda install cudatoolkit==11.8.0

You may also create two additional folders to avoid potential OS errors:

mkdir log
mkdir results

2 Experimental Setup

To run our text-to-SQL parsing experiments, we need to set up some third-party resources.

Download the original Spider dataset from this link.
Clone the Spider evaluation code from this GitHub repository:

git clone https://github.com/taoyds/test-suite-sql-eval.git

Download the original Bird dataset from this website and the evaluation code. Then, reorganize the resources in a folder:

├───  bird-sql
│    ├───  databases
│        ├───  ...
│    ├───  dev
│        ├───  ...
│    ├───  train
│        ├───  ...
│    ├───  evaluation.py

Note that we merge all train/dev databases into a single databases folder, instead of under the dev or train folder as in the original distribution.

Finally, we put all the directories in parallel:

├───  bird-sql
│    ├───  ...
├───  llm-planning-eval
│    ├───  ...
├───  spider
│    ├───  ...
├───  test-suite-sql-eval
│    ├───  ...

3 Data and Preprocessing

You can find all our preprocessed data at this link and unzip it inside this repository.

├───  llm-planning-eval
│    ├───  data
│        ├───  ...
│    ├───  evaluation_configs
│        ├───  ...
│    ├───  ...

To preprocess text-to-SQL datasets by yourself, you may refer to the example commands in scripts/preproc/. It may take some time depending on how fast your machine can process (large) databases.

For GSM8K, we simply extracted the numerical answers at the end of each annotated solution without any other preprocessing.

4 Evaluation

We include example scripts for all our experiments in scripts/. The LoRA weights of our fine-tuned LLMs can be accessed here.

Extension to Other LLMs

If you would like to extend our framework to other LLMs, you may implement a new generator/evaluator class or a new planning method under the corresponding directories and import them in the functions select_models/select_method. You may start your implementation by copying one of the provided source code files and modify the classes/funtions accordingly.

5 Citation

Please cite our paper with the following bibtex:

@misc{chen2024tree,
      title={When is Tree Search Useful for LLM Planning? It Depends on the Discriminator}, 
      author={Ziru Chen and Michael White and Raymond Mooney and Ali Payani and Yu Su and Huan Sun},
      year={2024},
      eprint={2402.10890},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

If used, please also cite the original datasets and evaluation methods accordingly:

@inproceedings{yu-etal-2018-spider,
    title = "{S}pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-{SQL} Task",
    author = "Yu, Tao  and
      Zhang, Rui  and
      Yang, Kai  and
      Yasunaga, Michihiro  and
      Wang, Dongxu  and
      Li, Zifan  and
      Ma, James  and
      Li, Irene  and
      Yao, Qingning  and
      Roman, Shanelle  and
      Zhang, Zilin  and
      Radev, Dragomir",
    editor = "Riloff, Ellen  and
      Chiang, David  and
      Hockenmaier, Julia  and
      Tsujii, Jun{'}ichi",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    month = oct # "-" # nov,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D18-1425",
    doi = "10.18653/v1/D18-1425",
    pages = "3911--3921",
}

@inproceedings{zhong-etal-2020-semantic,
    title = "Semantic Evaluation for Text-to-{SQL} with Distilled Test Suites",
    author = "Zhong, Ruiqi  and
      Yu, Tao  and
      Klein, Dan",
    editor = "Webber, Bonnie  and
      Cohn, Trevor  and
      He, Yulan  and
      Liu, Yang",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.29",
    doi = "10.18653/v1/2020.emnlp-main.29",
    pages = "396--411",
}

@inproceedings{li2023can,
title={Can {LLM} Already Serve as A Database Interface? A {BI}g Bench for Large-Scale Database Grounded Text-to-{SQL}s},
author={Jinyang Li and Binyuan Hui and GE QU and Jiaxi Yang and Binhua Li and Bowen Li and Bailin Wang and Bowen Qin and Ruiying Geng and Nan Huo and Xuanhe Zhou and Chenhao Ma and Guoliang Li and Kevin Chang and Fei Huang and Reynold Cheng and Yongbin Li},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023},
url={https://openreview.net/forum?id=dI4wzAE6uV}
}

@misc{cobbe2021gsm8k,
      title={Training Verifiers to Solve Math Word Problems}, 
      author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
      year={2021},
      eprint={2110.14168},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

OSU-NLP-Group / llm-planning-eval