HarshTrivedi / DecomP-ODQA

Official repository for ODQA experiments from Decomposed Prompting: A Modular Approach for Solving Complex Tasks, ICLR23

Home Page:https://arxiv.org/abs/2210.02406

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decomposed Prompting:
A Modular Approach for Solving Complex Tasks

This is the official repository for open-domain QA experiments from our ICLR 2023 paper "Decomposed Prompting: A Modular Approach for Solving Complex Tasks". Check out the main repository for the other experiments.

Installation

conda create -n decomp-odqa python=3.8.0 -y && conda activate decomp-odqa
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Prepare Data

You can download all our processed data by running

./download/processed_data.sh

The data will be downloaded in processed_data/{dataset_name}/. If you're just looking for just dev/test data we used in the paper, it's processed_data/{dataset_name}/{dev|test}_subsampled.jsonl.

Follow these steps if you want to generate all processed data from scratch again.
# 1. Download raw data:
## raw data will be in raw_data/{dataset_name}/
./download/raw_data.sh

# 2. Process raw data files in a single standard format
## processed data will be in processed_data/{dataset_name}/
python processing_scripts/process_hotpotqa.py
python processing_scripts/process_2wikimultihopqa.py
python processing_scripts/process_musique.py

# 4. Subsample the processed datasets.
## Note (i) dev processing has to be done before test.
## (ii) because of randomness it may create different samples that what we used,
## so consider using the released data if the goal is reproduction.
## (iii) sampled data will be in processed_data/{dataset_name}/{dev|test}_subsampled.jsonl
python processing_scripts/subsample_dataset_and_remap_paras.py hotpotqa dev
python processing_scripts/subsample_dataset_and_remap_paras.py hotpotqa test
python processing_scripts/subsample_dataset_and_remap_paras.py 2wikimultihopqa dev
python processing_scripts/subsample_dataset_and_remap_paras.py 2wikimultihopqa test
python processing_scripts/subsample_dataset_and_remap_paras.py musique dev
python processing_scripts/subsample_dataset_and_remap_paras.py musique test

# 5. Attach reasoning steps and supporting para annotations
## to the preprocessed (train) data files.
## To do this, you'll set up elasticsearch server, index all dataset corpuses.
## See 'Prepare Retriever and LLM Servers' section in the readme.
python prompt_generator/attach_data_annotations.py hotpotqa
python prompt_generator/attach_data_annotations.py 2wikimultihopqa
python prompt_generator/attach_data_annotations.py musique

You'll also need raw_data if you want to build elasticsearch indices and run retriever or odqa systems.

./download_raw_data.sh

The data will be downloaded in raw_data/{dataset_name}/.

Prepare Prompts

All our prompts are available in prompts/ directory. If you're using these prompts outside of this codebase, note that # METADATA: ... lines need to be ignored at runtime from it.

If you want to generate them from scratch, run

python prompt_generator/generate_prompts.py {dataset_name} # hotpotqa, 2wikimultihopqa, musique

Note though that because of random sampling to select distractors, some of the regenerated prompts may be different. So if you're goal is to reproduce the experiments, use the released ones.

Prepare Retriever and LLM Servers

First, install Elasticsearch 7.10.

Install on Mac (option 1)

# source: https://www.elastic.co/guide/en/elasticsearch/reference/current/brew.html
brew tap elastic/tap
brew install elastic/tap/elasticsearch-full # if it doesn't work: try 'brew untap elastic/tap' first: untap>tap>install.
brew services start elastic/tap/elasticsearch-full # to start the server
brew services stop elastic/tap/elasticsearch-full # to stop the server

Install on Mac (option 2)

# source: https://www.elastic.co/guide/en/elasticsearch/reference/current/targz.html
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-darwin-x86_64.tar.gz.sha512
shasum -a 512 -c elasticsearch-7.10.2-darwin-x86_64.tar.gz.sha512
tar -xzf elasticsearch-7.10.2-darwin-x86_64.tar.gz
cd elasticsearch-7.10.2/
./bin/elasticsearch # start the server
pkill -f elasticsearch # to stop the server

Install on Linux

# source: https://www.elastic.co/guide/en/elasticsearch/reference/8.1/targz.html
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
shasum -a 512 -c elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz
cd elasticsearch-7.10.2/
./bin/elasticsearch # start the server
pkill -f elasticsearch # to stop the server

Checkout the references sources if you run into problems installing it.

Start the the elasticsearch server on port 9200 (default), and then start the retriever server as show here. You can change the elasticsearch port in retriever_server/serve.py if needed.

uvicorn serve:app --port 8000 --app-dir retriever_server

Next, index the wikipedia corpuses for the datasets. Make sure you've downloaded raw_data and processed_data.

python retriever_server/build_index.py {dataset_name} # hotpotqa, iirc, 2wikimultihopqa, musique

After indexing you can check the number of documents in each index by running curl localhost:9200/_cat/indices. You should have 4 indices, one for each dataset, called {dataset}-wikipedia. Make sure the match up to the statistics given in the paper. You should expec to see following sizes: HotpotQA (5,233,329), 2WikiMultihopQA (430,225), MuSiQue (139,416).

Next, if you want to use flan-t5-* models, start the llm_server by running:

MODEL_NAME={model_name} uvicorn serve:app --port 8010 --app-dir llm_server # model_name: flan-t5-xxl, flan-t5-xl, flan-t5-large

If you want to use openai models (e.g., codex in our experiments), you don't need to start it. In that case, you just need to set the environment variable OPENAI_API_KEY.

If you start retriever and/or llm_server on different host or port, update them in .retriever_address.jsonnet and .llm_server_address.jsonnet before running retrieval/odqa systems.

Run Retrieval and ODQA Systems

First, download dataset repositories for official evaluation: ./download/official_eval.sh.

Next, set the variables:

  • SYSTEM: choose from (decomp_context, no_decomp_context, no_context)
  • READER: choose from (direct, cot)
  • MODEL: choose from (codex, flan-t5-xxl, flan-t5-xl, flan-t5-large)
  • DATASET: choose from (hotpotqa, 2wikimultihopqa, musique)

decomp_context is our proposed system, and the other two are the baselines. It works better with cot reader than the direct reader, and it works best with the codex model. Note that for flan-t5-* models, we still used codex for the question decomposition in our structured format as flan-t5* was not good at it. The non-decomposition modules were made out of flan-t5-* variations in those settings.

You can run the code using:

./reproduce.sh $SYSTEM $READER $MODEL $DATASET

This script runs several things one after the other: instantiating experiment configs with HPs, running predictions for them on the dev set, picking up the best HP, making experiment config with the best HP, running it on the test set, and summarizing the results with mean and std.

If you prefer to have more control, you can also run it step-by-step as follows:

# Instantiate experiment configs with different HPs and write them in files.
python runner.py $SYSTEM $READER $MODEL $DATASET write --prompt_set 1
python runner.py $SYSTEM $READER $MODEL $DATASET write --prompt_set 2
python runner.py $SYSTEM $READER $MODEL $DATASET write --prompt_set 3
## if you make a change to base_configs, the above steps need to be rerun to
## regenerate instantiated experiment configs (with HPs populated)

# Run experiments for different HPs on dev set
python runner.py $SYSTEM $READER $MODEL $DATASET predict --prompt_set 1
## predict command runs evaluation at the end by default. If you want to run evaluation
## separately after prediction, you can replace predict with evaluate here.

# Show results for experiments with different HPs
python runner.py $SYSTEM $READER $MODEL $DATASET summarize --prompt_set 1
## Not necessary as such, it'll just show you the results using different HPs in a nice table.

# Pick the best HP and save the config with that HP.
python runner.py $SYSTEM $READER $MODEL $DATASET write --prompt_set 1 --best
python runner.py $SYSTEM $READER $MODEL $DATASET write --prompt_set 2 --best
python runner.py $SYSTEM $READER $MODEL $DATASET write --prompt_set 3 --best

# Run the experiment with best HP on test set
python runner.py $SYSTEM $READER $MODEL $DATASET predict --prompt_set 1 --best --eval_test --official
python runner.py $SYSTEM $READER $MODEL $DATASET predict --prompt_set 2 --best --eval_test --official
python runner.py $SYSTEM $READER $MODEL $DATASET predict --prompt_set 3 --best --eval_test --official
## predict command runs evaluation at the end by default. If you want to run evaluation
## separately after prediction, you can replace predict with evaluate here.

# Summarize best test results for individual prompts and aggregate (mean +- std) of them)
python runner.py $SYSTEM $READER $MODEL $DATASET summarize --prompt_set 1 --best --eval_test --official
python runner.py $SYSTEM $READER $MODEL $DATASET summarize --prompt_set 2 --best --eval_test --official
python runner.py $SYSTEM $READER $MODEL $DATASET summarize --prompt_set 3 --best --eval_test --official
python runner.py $SYSTEM $READER $MODEL $DATASET summarize --prompt_set aggregate --best --eval_test --official
## The mean and std in the final command is what we reported in the paper.

DISCLAIMER: Please note that all our experiments relies on codex, which was deprecated after our submission. You can do these experiments with other OpenAI completion models, or other open/commercial models (see notes below). But keep track of the cost, as it may add up quickly doing these experiments.

Running IRCoT (QA) using a Different Dataset or LLM

Each experiment (system, model, data combination) in this project corresponds to an experiment config in base_configs/...jsonnet. Find the experiment closest to your usecase and change the model, dataset and related information in it as per your need.

If you've changed the dataset, you'll need to ensure the Elasticsearch index of that name is available (see processing-notes and setting-up-retriever for it).

If you've changed the model, you'll need to ensure model of that name is implemented and available in the code. If you want to try out a different OpenAI completion model, it'd just involve configuring the engine variable and setting the model_tokens_limit in here. Chat-based API isn't readily supported yet, but shouldn't be much work if you're interested. If you're interested in open LLMs, like Llama, MPT, etc, you can set up OpenAI-complaint FastChat server as shown here, and made necessary changes in the base_config/ and you should be good to go.

If you're stuck anywhere in this process, open an issue with your specific choice of data/model, and I can help you to get there.

Acknowledgement

This code is heavily based on CommaQA, which provides a way to build complex/multi-step systems involving agents. All modeling-related code for IRCoT project is in commaqa/inference/odqa.py, and all experiment configs (without HPs instantiated) for this project are in base_configs/.

Citation

If you find this work useful, consider citing it:

@inproceedings{
    khot2023decomposed,
    title={Decomposed Prompting: A Modular Approach for Solving Complex Tasks},
    author={Tushar Khot and Harsh Trivedi and Matthew Finlayson and Yao Fu and Kyle Richardson and Peter Clark and Ashish Sabharwal},
    booktitle={The Eleventh International Conference on Learning Representations },
    year={2023},
    url={https://openreview.net/forum?id=_nGgzQjzaRy}
}

About

Official repository for ODQA experiments from Decomposed Prompting: A Modular Approach for Solving Complex Tasks, ICLR23

https://arxiv.org/abs/2210.02406

License:Apache License 2.0


Languages

Language:Jsonnet 52.5%Language:Python 46.5%Language:Shell 0.9%Language:Dockerfile 0.1%