PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Requirements

Linux
Python 3.6+
- Install required packages with pip install -r requirements.txt
Fast Downward
1. Download from here
2. Assign path of the folder to the environment variable FAST_DOWNWARD FAST_DOWNWARD=/path/to/fast_downward
VAL
1. Use the version in planner_tools or download from here
2. Assign path of the folder to the environment variable VAL VAL=/path/to/val
PR2Plan
1. Use the version in planner_tools or download and compile obs-compiler from here
2. Assign path of the folder to the environment variable PR2 PR2=/path/to/pr2plan
LLM access/setup - (currently OpenAI/BLOOM)

Usage

Run the following command to run the entire pipeline:

python3 llm_plan_pipeline.py --task TASK --config CONFIG --engine ENGINE [--ignore_existing] [--run_till_completion RUN-TILL-COMPLETION] [--specific_instances SPECIFIC-INSTANCES] [--random_example RANDOM-EXAMPLE] [--verbose VERBOSE] [--seed SEED]

Required arguments:

--task: The task to run. Refer to the list of tasks below.
--config: The name of the config file to use. The config file must be a YAML file present in the configs folder.
--engine: The name of the engine to use. Refer to the list of engines below.

Optional arguments:

--ignore_existing: If added as part of the command, the pipeline will ignore the already completed instances and rerun the entire pipeline. If not added, the pipeline will not redo already completed instances. Default is False.
--run_till_completion: If set to True, the pipeline will rerun the task until it is completed successfully. If set to False, the task will be run once and all failures while retrieving the response from the model will be ignored and the response will be noted as empty. Default is False.
--specific_instances: If a list of instance ids is provided, the pipeline will only run the task on those instances. If not provided, the pipeline will run the task on all instances between the start and end provided in the config file. Default is None. For example, --specific_instances 1 2 3 4 5
--random_example: If set to True, the example instance for each task will be randomly chosen from the set of instances. If set to False, the previous instance id will be used for the example prompt. Default is False.
--verbose: If set to True, the pipeline will print the prompts, responses and evaluation. Default is False.
--seed: The seed to use for randomization. Default is 42.

Run the following command to only run prompt generation:

python3 prompt_generation.py --task TASK --config CONFIG [--ignore_existing] [--specific_instances SPECIFIC-INSTANCES] [--random_example RANDOM-EXAMPLE] [--verbose VERBOSE] [--seed SEED]

This will generate the prompts for the given task and store them in the prompts folder as json files.

Run the following command to only run response generation (PROMPT JSONS MUST BE GENERATED FIRST):

python3 response_generation.py --task TASK --config CONFIG --engine ENGINE [--ignore_existing] [--run_till_completion RUN-TILL-COMPLETION]

This will generate the responses for the given task using the generated prompts. The generated responses are appended to the prompt jsons and are stored in the responses folder.

Run the following command to only run evaluation (RESPONSE JSONS MUST BE GENERATED FIRST):

python3 response_evaluation.py --task TASK --config CONFIG --engine ENGINE [--ignore_existing] [--verbose VERBOSE]

This will evaluate the raw responses generated by the model. The evaluation is appended to the response jsons and the final results are stored in the results folder.

List of tasks:

t1 = Plan Generation
t2 = Optimal Planning
t3 = Plan Verification
t4 = Plan Reuse
t5 = Plan Generalization
t6 = Replanning
t7 = Reasoning about Plan Execution
t8_1 = Goal Reformulation (Goal shuffling)
t8_2 = Goal Reformulation (Full -> Partial)
t8_3 = Goal Reformulation (Partial -> Full)

List of engines.

The engines are not limited to the ones listed below.

OpenAI models:
- For completion models in OpenAI just specify the model name.
  - ada, davinci, text-davinci-002 etc.
- For chat models in OpenAI add the suffix '_chat' to the model name
  - gpt-3.5-turbo_chat, gpt-4_chat etc.
- For fine-tuned models in OpenAI add the prefix 'finetuned:' to the model name
  - finetuned:davinci:2022-05-03-00-00-00 etc.
Other LLMs (currently supported: BLOOM):
- Just specify the LLM name
  - bloom etc.

For BLOOM:

Assign the cache dir of the model to the environment variable BLOOM_CACHE_DIR BLOOM_CACHE_DIR=/path/to/bloom/cache/dir

Problem Generators

We have also provided the problem generators used in PlanBench. The problems as part of the main dataset are generated by generators that are part of the IPC competitions (Github Repo). We have added additional filters on them. Along with that we have our own problem generators for the Plan Generalization Task. Both the kinds of generators are in problem_generators.py. The problem generators can be used as follows:

python3 problem_generators.py --config CONFIG [--is_generalization] [--n_instances INSTANCES] [--max_blocks MAX-BLOCKS]

Required arguments:

--config: The name of the config file to use. The config file must be a YAML file present in the configs folder.

Optional arguments:

--is_generalization: If added as part of the command, the generator will generate problems for the Plan Generalization Task. If not added, the generator will generate problems for the other tasks.
--n_instances: The number of instances to generate. Default is 0 and will use the n_instances value in the config file.
--max_blocks (ONLY FOR BLOCKSWORLD DOMAIN): The maximum number of blocks in the generated problems as part of the main dataset. Default is 5.

Obfuscation of domains (Deceptive or Randomized)

We have included a set of obfuscations to generate mystery versions of each domain. The obfuscations are in the mystery folder under each instance's folder. We have also provided a way to generate arbitrary obfuscated versions (deceptive or randomized) for any domain. The obfuscated versions can be generated as follows:

python3 obfuscator.py --config CONFIG [--randomized_obfuscation] [--words_filename WORDS-FILENAME] [--seed SEED] [--output_filename OUTPUT-FILENAME]

Required arguments:

--config: The name of the config file to use. The config file must be a YAML file present in the configs folder.

Optional arguments:

--randomized_obfuscation: If added as part of the command, the obfuscator will generate a randomized obfuscation. If not added, the obfuscator will generate a deceptive obfuscation.
--words_filename: The name of the file containing the words to use for obfuscation. The file must be a text file with each word in a new line. Default is obfuscate/random_words_1.txt.
--seed: The seed to use for randomization. Default is 0.
--output_filename: The name of the file to store the obfuscated domain config file. Default is configs/obfuscated_[TYPE-OF-OBFUSCATION]_[DOMAIN-NAME].yaml.

Adding a new LLM into PlanBench

If the LLM is loaded locally
1. Add the LLM querying code in the send_query function in utils/llm_utils.py based on the engine name.
2. Load the model by adding a function in ResponseGenerator class in response_generation.py and call it in the __init__ function based on the engine name.
If the LLM is loaded remotely
1. Either replace the send_query function in utils/llm_utils.py with the required LLM querying code or add a new function and call it in the send_query function in utils/llm_utils.py based on the engine name.

Adding a new IPC domain in PlanBench

Generate a set of instances and add them in a separate folder for the domain in the instances folder
Add a .yaml file in the configs folder containing the specifics of the domain
- The .yaml file should contain the following:
  - domain_name: The name of the domain
  - domain_file: The path to the domain file
  - instance_dir: The path to the directory containing the instances
  - generalized_instance_dir: The path to the directory containing the instances for the Plan Generalization task
  - instance_template: The template for the instance file names
  - n_instances: The number of instances in the domain
  - domain_intro: The translated domain description
  - domain_intro_cost: The translated domain description with costs for Optimal Planning task
  - actions: A dictionary that maps action names to their translated descriptions. The parameters of the action should be represented as {}. For example, "move": "move {} from {} to {}" for an action "move" with parameters object, location1 and location2.
  - predicates: A dictionary that maps predicate names to their translated descriptions. The parameters of the predicate should be represented as {}. For example, "at": "object {} is at location {}" for a predicate "at" with parameters object and location.
  - encoded_objects: A dictionary that maps object names to their encoded names. For example, "a":"red block" or "p": "package_{}".
  - predicate_mapping (optional): A dictionary that maps predicate names to the crucial part of the translated description for reverse translation. For example, "the {} is on the table" for a predicate ontable with parameters object, the crucial part is "on the table" without the parameters.
Add domain specific translations in various functions in utils/.
- Look for # ADD SPECIFIC TRANSLATION FOR EACH DOMAIN HERE in pddl_to_text.py, text_to_pddl.py and task_to_text.py and add the specific translations for the domain.
For the Plan Generalization task, you have to generate a specific set of instances and add the path of that directory in the yaml file.
For the Replanning task, if you want to perform a specific type of replanning, you have to add that in replanning_domain_specific function in the Executor/__init__.py file. Make sure that the domain name is the same as the domain name in the yaml file.
Voila! Run PlanBench on the new domain.

"# plan-bench-barman-update"

abm120 / plan-bench-barman-update

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Requirements

Usage

Run the following command to run the entire pipeline:

Required arguments:

Optional arguments:

Run the following command to only run prompt generation:

Run the following command to only run response generation (PROMPT JSONS MUST BE GENERATED FIRST):

Run the following command to only run evaluation (RESPONSE JSONS MUST BE GENERATED FIRST):

List of tasks:

List of engines.

Problem Generators

Obfuscation of domains (Deceptive or Randomized)

Adding a new LLM into PlanBench

Adding a new IPC domain in PlanBench

About

Languages